This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Ophiuchus: Scalable Modeling of Protein Structures through Hierarchical Coarse-graining SO(3)-Equivariant Autoencoders

Allan dos Santos Costa
Center for Bits and Atoms
MIT Media Lab, Molecular Machines
allanc@mit.edu
&Ilan Mitnikov
Center for Bits and Atoms
MIT Media Lab, Molecular Machines

&Mario Geiger
Atomic Architects
MIT Research Laboratory of Electronics &Manvitha Ponnapati
Center for Bits and Atoms
MIT Media Lab, Molecular Machines

&Tess Smidt
Atomic Architects
MIT Research Laboratory of Electronics
&Joseph Jacobson
Center for Bits and Atoms
MIT Media Lab, Molecular Machines
Abstract

Three-dimensional native states of natural proteins display recurring and hierarchical patterns. Yet, traditional graph-based modeling of protein structures is often limited to operate within a single fine-grained resolution, and lacks hourglass neural architectures to learn those high-level building blocks. We narrow this gap by introducing Ophiuchus, an SO(3)-equivariant coarse-graining model that efficiently operates on all-atom protein structures. Our model departs from current approaches that employ graph modeling, instead focusing on local convolutional coarsening to model sequence-motif interactions with efficient time complexity in protein length. We measure the reconstruction capabilities of Ophiuchus across different compression rates, and compare it to existing models. We examine the learned latent space and demonstrate its utility through conformational interpolation. Finally, we leverage denoising diffusion probabilistic models (DDPM) in the latent space to efficiently sample protein structures. Our experiments demonstrate Ophiuchus to be a scalable basis for efficient protein modeling and generation.

1 Introduction

Proteins form the basis of all biological processes and understanding them is critical to biological discovery, medical research and drug development. Their three-dimensional structures often display modular organization across multiple scales, making them promising candidates for modeling in motif-based design spaces [Bystroff & Baker (1998); Mackenzie & Grigoryan (2017); Swanson et al. (2022)]. Harnessing these coarser, lower-frequency building blocks is of great relevance to the investigation of the mechanisms behind protein evolution, folding and dynamics [Mackenzie et al. (2016)], and may be instrumental in enabling more efficient computation on protein structural data through coarse and latent variable modeling [Kmiecik et al. (2016); Ramaswamy et al. (2021)].

Recent developments in deep learning architectures applied to protein sequences and structures demonstrate the remarkable capabilities of neural models in the domain of protein modeling and design [Jumper et al. (2021); Baek et al. (2021b); Ingraham et al. (2022); Watson et al. (2022)]. Still, current state-of-the-art architectures lack the structure and mechanisms to directly learn and operate on modular protein blocks.

To fill this gap, we introduce Ophiuchus, a deep SO(3)-equivariant model that captures joint encodings of sequence-structure motifs of all-atom protein structures. Our model is a novel autoencoder that uses one-dimensional sequence convolutions on geometric features to learn coarsened representations of proteins. Ophiuchus outperforms existing SO(3)-equivariant autoencoders [Fu et al. (2023)] on the protein reconstruction task. We present extensive ablations of model performance across different autoencoder layouts and compression settings. We demonstrate that our model learns a robust and structured representation of protein structures by learning a denoising diffusion probabilistic model (DDPM) [Ho et al. (2020)] in the latent space. We find Ophiuchus to enable significantly faster sampling of protein structures, as compared to existing diffusion models [Wu et al. (2022a); Yim et al. (2023); Watson et al. (2023)], while producing unconditional samples of comparable quality and diversity.

Refer to caption
Figure 1: Coarsening a Three-Dimensional Sequence. (a) Each residue is represented with its 𝐂α\mathbf{C}_{\alpha} atom position and geometric features that encode its label and the positions of its other atoms. (\ast) side-chains with non-orderable atoms are encoded in a permutation-invariant way. (b) Our proposed model uses roto-translation equivariant convolutions to coarsen these positions and geometric features. (c) We train deep autoencoders to reconstruct all-atom protein structures directly in three-dimensions.

Our main contributions are summarized as follows:

  • Novel Autoencoder: We introduce a novel SO(3)-equivariant autoencoder for protein sequence and all-atom structure representation. We propose novel learning algorithms for coarsening and refining protein representations, leveraging irreducible representations of SO(3) to efficiently model geometric information. We demonstrate the power of our latent space through unsupervised clustering and latent interpolation.

  • Extensive Ablation: We offer an in-depth examination of our architecture through extensive ablation across different protein lengths, coarsening resolutions and model sizes. We study the trade-off of producing a coarsened representation of a protein at different resolutions and the recoverability of its sequence and structure.

  • Latent Diffusion: We explore a novel generative approach to proteins by performing latent diffusion on geometric feature representations. We train diffusion models for multiple resolutions, and provide diverse benchmarks to assess sample quality. To the best of our knowledge, this is the first generative model to directly produce all-atom structures of proteins.

2 Background and Related Work

2.1 Modularity and Hierarchy in Proteins

Protein sequences and structures display significant degrees of modularity. [Vallat et al. (2015)] introduces a library of common super-secondary structural motifs (Smotifs), while [Mackenzie et al. (2016)] shows protein structural space to be efficiently describable by small tertiary alphabets (TERMs). Motif-based methods have been successfully used in protein folding and design [Bystroff & Baker (1998); Li et al. (2022)]. Inspired by this hierarchical nature of proteins, our proposed model learns coarse-grained representations of protein structures.

2.2 Symmetries in Neural Architecture for Biomolecules

Learning algorithms greatly benefit from proactively exploiting symmetry structures present in their data domain [Bronstein et al. (2021); Smidt (2021)]. In this work, we investigate three relevant symmetries for the domain of protein structures:

Euclidean Equivariance of Coordinates and Feature Representations. Neural models equipped with roto-translational (Euclidean) invariance or equivariance have been shown to outperform competitors in molecular and point cloud tasks [Townshend et al. (2022); Miller et al. (2020); Deng et al. (2021)]. Similar results have been extensively reported across different structural tasks of protein modeling [Liu et al. (2022); Jing et al. (2021)]. Our proposed model takes advantage of Euclidean equivariance both in processing of coordinates and in its internal feature representations, which are composed of scalars and higher order geometric tensors [Thomas et al. (2018); Weiler et al. (2018)].

Translation Equivariance of Sequence. One-dimensional Convolutional Neural Networks (CNNs) have been demonstrated to successfully model protein sequences across a variety of tasks [Karydis (2017); Hou et al. (2018); Lee et al. (2019); Yang et al. (2022)]. These models capture sequence-motif representations that are equivariant to translation of the sequence. However, sequential convolution is less common in architectures for protein structures, which are often cast as Graph Neural Networks (GNNs) [Zhang et al. (2021)]. Notably, [Fan et al. (2022)] proposes a CNN network to model the regularity of one-dimensional sequences along with three-dimensional structures, but they restrict their layout to coarsening. In this work, we further integrate geometry into sequence by directly using three-dimensional vector feature representations and transformations in 1D convolutions. We use this CNN to investigate an autoencoding approach to protein structures.

Permutation Invariances of Atomic Order. In order to capture the permutable ordering of atoms, neural models of molecules are often implemented with permutation-invariant GNNs [Wieder et al. (2020)]. Nevertheless, protein structures are sequentially ordered, and most standard side-chain heavy atoms are readily orderable, with exception of four residues [Jumper et al. (2021)]. We use this fact to design an efficient approach to directly model all-atom protein structures, introducing a method to parse atomic positions in parallel channels as roto-translational equivariant feature representations.

2.3 Unsupervised Learning of Proteins

Unsupervised techniques for capturing protein sequence and structure have witnessed remarkable advancements in recent years [Lin et al. (2023); Elnaggar et al. (2023); Zhang et al. (2022)]. Amongst unsupervised methods, autoencoder models learn to produce efficient low-dimensional representations using an informational bottleneck. These models have been successfully deployed to diverse protein tasks of modeling and sampling [Eguchi et al. (2020); Lin et al. (2021); Wang et al. (2022); Mansoor et al. (2023); Visani et al. (2023)], and have received renewed attention for enabling the learning of coarse representations of molecules [Wang & Gómez-Bombarelli (2019); Yang & Gómez-Bombarelli (2023); Winter et al. (2021); Wehmeyer & Noé (2018); Ramaswamy et al. (2021)]. However, existing three-dimensional autoencoders for proteins do not have the structure or mechanisms to explore the extent to which coarsening is possible in proteins. In this work, we fill this gap with extensive experiments on an autoencoder for deep protein representation coarsening.

2.4 Denoising Diffusion for Proteins

Denoising Diffusion Probabilistic Models (DDPM) [Sohl-Dickstein et al. (2015); Ho et al. (2020)] have found widespread adoption through diverse architectures for generative sampling of protein structures. Chroma [Ingraham et al. (2022)] trains random graph message passing through roto-translational invariant features, while RFDiffusion [Watson et al. (2022)] fine-tunes pretrained folding model RoseTTAFold [Baek et al. (2021a)] to denoising, employing SE(3)-equivariant transformers in structural decoding [Fuchs et al. (2020)]. [Yim et al. (2023); Anand & Achim (2022)] generalize denoising diffusion to frames of reference, employing Invariant Point Attention [Jumper et al. (2021)] to model three dimensions, while FoldingDiff [Wu et al. (2022a)] explores denoising in angular space. More recently, [Fu et al. (2023)] proposes a latent diffusion model on coarsened representations learned through Equivariant Graph Neural Networks (EGNN) [Satorras et al. (2022)]. In contrast, our model uses roto-translation equivariant features to produce increasingly richer structural representations from autoencoded sequence and coordinates. We propose a novel latent diffusion model that samples directly in this space for generating protein structures.

3 The Ophiuchus Architecture

We represent a protein as a sequence of NN residues each with an anchor position 𝐏1×3\mathbf{P}\in\mathbb{R}^{1\times 3} and a tensor of irreducible representations of SO(3) 𝐕0:lmax\mathbf{V}^{0:{l_{\max}}}, where 𝐕ld×(2l+1)\mathbf{V}^{l}\in\mathbb{R}^{d\times(2l+1)} and degree l[0,lmax]l\in[0,{l_{\max}}]. A residue state is defined as (𝐏,𝐕0:lmax)(\mathbf{P},\mathbf{V}^{0:{l_{\max}}}). These representations are directly produced from sequence labels and all-atom positions, as we describe in the Atom Encoder/Decoder sections. To capture the diverse interactions within a protein, we propose three main components. Self-Interaction learns geometric representations of each residue independently, modeling local interactions of atoms within a single residue. Sequence Convolution simultaneously updates sequential segments of residues, modeling inter-residue interactions between sequence neighbors. Finally, Spatial Convolution employs message-passing of geometric features to model interactions of residues that are nearby in 3D space. We compose these three modules to build an hourglass architecture.

Refer to caption
Figure 2: Building Blocks of Ophiuchus: (a) Atom Encoder and (b) Atom Decoder enable the model to directly take and produce atomic coordinates. (c) Self-Interaction updates representations internally, across different vector orders ll. (d) Spatial Convolution interacts spatial neighbors. (e) Sequence Convolution and (f) Transpose Sequence Convolution communicate sequence neighbors and produce coarser and finer representations, respectively. (g) Hourglass model: we compose those modules to build encoder and decoder models, stacking them into an autoencoder.

3.1 All-Atom Atom Encoder and Decoder

Given a particular ii-th residue, let 𝐑i\mathbf{R}_{i}\in\mathcal{R} denote its residue label, 𝐏iα1×3\mathbf{P}_{i}^{\alpha}\in\mathbb{R}^{1\times 3} denote the global position of its alpha carbon (𝐂α\mathbf{C}_{\alpha}), and 𝐏in×3\mathbf{P}^{\ast}_{i}\in\mathbb{R}^{n\times 3} the position of all nn other atoms relative to 𝐂α\mathbf{C}_{\alpha}. We produce initial residue representations (𝐏,𝐕0:lmax)i(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})_{i} by setting anchors 𝐏i=𝐏iα\mathbf{P}_{i}=\mathbf{P}^{\alpha}_{i}, scalars 𝐕il=0=Embed(𝐑i)\mathbf{V}_{i}^{l=0}=\textrm{Embed}(\mathbf{R}_{i}) , and geometric vectors 𝐕il>0\mathbf{V}_{i}^{l>0} to explicitly encode relative atomic positions 𝐏i\mathbf{P}^{\ast}_{i}.

In particular, provided the residue label 𝐑\mathbf{R}, the heavy atoms of most standard protein residues are readily put in a canonical order, enabling direct treatment of atom positions as a stack of signals on SO(3): 𝐕l=1=𝐏\mathbf{V}^{l=1}=\mathbf{P}^{*}. However, some of the standard residues present two-permutations within their atom configurations, in which pairs of atoms have ordering indices (vv, uu) that may be exchanged (Appendix A.1). To handle these cases, we instead use geometric vectors to encode the center 𝐕centerl=1=12(𝐏v+𝐏u)\mathbf{V}^{l=1}_{\textrm{center}}=\frac{1}{2}(\mathbf{P}^{*}_{v}+\mathbf{P}^{*}_{u}) and the unsigned difference 𝐕diffl=2=Y2(12(𝐏v𝐏u))\mathbf{V}^{l=2}_{\textrm{diff}}=Y_{2}(\frac{1}{2}(\mathbf{P}^{*}_{v}-\mathbf{P}^{*}_{u})) between the positions of the pair, where Y2Y_{2} is a spherical harmonics projector of degree l=2l=2. This signal is invariant to corresponding atomic two-flips, while still directly carrying information about positioning and angularity. To invert this encoding, we invert a signal of degree l=2l=2 into two arbitrarily ordered vectors of degrees l=1l=1. Please refer to Appendix A.2 for further details.

Input: 𝐂α\mathbf{C}_{\alpha} Position 𝐏α1×3\mathbf{P}^{\alpha}\in\mathbb{R}^{1\times 3}
Input: All-Atom Relative Positions 𝐏n×3\mathbf{P}^{\ast}\in\mathbb{R}^{n\times 3}
Input: Residue Label 𝐑\mathbf{R}\in\mathcal{R}
Output: Latent Representation (𝐏,𝐕l=0:2)(\mathbf{P},\mathbf{V}^{l=0:2})
𝐏𝐏α\mathbf{P}\leftarrow\mathbf{P}^{\alpha}
𝐕l=0Embed(𝐑)\mathbf{V}^{l=0}\leftarrow\textrm{Embed}(\mathbf{R})
𝐕orderedl=1GetOrderablePositions(𝐑,𝐏)\mathbf{V}^{l=1}_{\textrm{ordered}}\leftarrow\textrm{GetOrderablePositions}(\mathbf{R},\mathbf{P}^{\ast})
𝐏v,𝐏uGetUnorderablePositionPairs(𝐑,𝐏){\mathbf{P}^{*}_{v},\mathbf{P}^{*}_{u}\leftarrow\textrm{GetUnorderablePositionPairs}(\mathbf{R},\mathbf{P}^{\ast})}\;
𝐕centerl=112(𝐏v+𝐏u)\mathbf{V}^{l=1}_{\textrm{center}}\leftarrow\frac{1}{2}(\mathbf{P}^{*}_{v}+\mathbf{P}^{*}_{u})
𝐕diffl=2Y2(12(𝐏v𝐏u))\mathbf{V}^{l=2}_{\textrm{diff}}\leftarrow Y^{2}\big{(}\frac{1}{2}(\mathbf{P}^{*}_{v}-\mathbf{P}^{*}_{u})\big{)}
𝐕l=0:2𝐕orderedl=1𝐕centerl=1𝐕diffl=2\mathbf{V}^{l=0:2}\leftarrow\mathbf{V}^{l=1}_{\textrm{ordered}}\oplus\mathbf{V}^{l=1}_{\textrm{center}}\oplus\mathbf{V}^{l=2}_{\textrm{diff}}
return (𝐏,𝐕l=0:2)(\mathbf{P},\mathbf{V}^{l=0:2})
Algorithm 1 All-Atom Encoding
Input: Latent Representation (𝐏,𝐕l=0:2)(\mathbf{P},\mathbf{V}^{l=0:2})
Output: 𝐂α\mathbf{C}_{\alpha} Position 𝐏α1×3\mathbf{P}^{\alpha}\in\mathbb{R}^{1\times 3}
Output: All-Atom Relative Positions 𝐏||×n×3\mathbf{P}^{\ast}\in\mathbb{R}^{|\mathcal{R}|\times n\times 3}
Output: Residue Label Logits ||\bm{\ell}\in\mathbb{R}^{|\mathcal{R}|}
𝐏α𝐏\mathbf{P}^{\alpha}\leftarrow\mathbf{P}
LogSoftmax(Linear(𝐕l=0))\bm{\ell}\leftarrow{\textrm{LogSoftmax}}\big{(}\textrm{Linear}(\mathbf{V}^{l=0})\big{)}
𝐕^orderedl=1Linear(𝐕l=1)\hat{\mathbf{V}}^{l=1}_{\textrm{ordered}}\leftarrow\textrm{Linear}(\mathbf{V}^{l=1})
𝐕^centerl=1,𝐕^diffl=2Linear(𝐕l=0:2)\hat{\mathbf{V}}^{l=1}_{\textrm{center}},\hat{\mathbf{V}}^{l=2}_{\textrm{diff}}\leftarrow\textrm{Linear}(\mathbf{V}^{l=0:2})
Δ𝐏v,uEigendecompose(𝐕^diffl=2)\Delta\mathbf{P}_{v,u}\leftarrow\textrm{Eigendecompose}\big{(}\hat{\mathbf{V}}^{l=2}_{\textrm{diff}}\big{)}
𝐏v,𝐏u=𝐕^centerl=1+Δ𝐏v,u,𝐕^centerl=1Δ𝐏v,u\mathbf{P}^{*}_{v},\mathbf{P}^{*}_{u}=\hat{\mathbf{V}}^{l=1}_{\textrm{center}}+\Delta\mathbf{P}_{v,u},\hat{\mathbf{V}}^{l=1}_{\textrm{center}}-\Delta\mathbf{P}_{v,u}
𝐏𝐏𝐏v𝐏u\mathbf{P}^{\ast}\leftarrow\mathbf{P}^{\ast}\oplus\mathbf{P}^{*}_{v}\oplus\mathbf{P}^{*}_{u}
return (𝐏α,𝐏,)(\mathbf{P}^{\alpha},\mathbf{P}^{\ast},\bm{\ell})
Algorithm 2 All-Atom Decoding

This processing makes Ophiuchus strictly blind to ordering flips of permutable atoms, while still enabling it to operate directly on all atoms in an efficient, stacked representation. In Appendix A.1, we illustrate how this approach correctly handles the geometry of side-chain atoms.

3.2 Self-Interaction

Our Self-Interaction is designed to model the internal interactions of atoms within each residue. This transformation updates the feature vectors 𝐕i0:lmax\mathbf{V}^{0:{l_{\max}}}_{i} centered at the same residue ii. Importantly, it blends feature vectors 𝐕l\mathbf{V}^{l} of varying degrees ll by employing tensor products of the features with themselves. We offer two implementations of these tensor products to cater to different computational needs. Our Self-Interaction module draws inspiration from MACE [Batatia et al. (2022)]. For a comprehensive explanation, please refer to Appendix A.3.

Input: Latent Representation (𝐏,𝐕0:lmax)(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})
𝐕0:lmax𝐕0:lmax(𝐕0:lmax)2\mathbf{V}^{0:{l_{\max}}}\leftarrow\mathbf{V}^{0:{l_{\max}}}\oplus\left(\mathbf{V}^{0:{l_{\max}}}\right)^{\otimes 2}
  \triangleright Tensor Square and Concatenate
𝐕0:lmaxLinear(𝐕0:lmax)\mathbf{V}^{0:{l_{\max}}}\leftarrow\mathrm{Linear}(\mathbf{V}^{0:{l_{\max}}})
  \triangleright Update features
𝐕0:lmaxMLP(𝐕l=0)𝐕0:lmax\mathbf{V}^{0:{l_{\max}}}\leftarrow\textrm{MLP}(\mathbf{V}^{l=0})\cdot\mathbf{V}^{0:{l_{\max}}}
  \triangleright Gate Activation Function
return (𝐏,𝐕0:lmax)(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})
Algorithm 3 Self-Interaction

3.3 Sequence Convolution

To take advantage of the sequential nature of proteins, we propose a one-dimensional, roto-translational equivariant convolutional layer for acting on geometric features and positions of sequence neighbors. Given a kernel window size KK and stride SS, we concatenate representations 𝐕iK2:i+K20:lmax\mathbf{V}^{0:{l_{\max}}}_{i-\frac{K}{2}:i+\frac{K}{2}} with the same ll value. Additionally, we include normalized relative vectors between anchoring positions 𝐏iK2:i+K2\mathbf{P}_{i-\frac{K}{2}:i+\frac{K}{2}}. Following conventional CNN architectures, this concatenated representation undergoes a linear transformation. The scalars in the resulting representation are then converted into weights, which are used to combine window coordinates into a new coordinate. To ensure translational equivariance, these weights are constrained to sum to one.

Input: Window of Latent Representations (𝐏,𝐕0:lmax)1:K(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})_{1:K}
w1:KSoftmax(MLP(𝐕1:K0))w_{1:K}\leftarrow\textrm{Softmax}\Big{(}\textrm{MLP}\big{(}\mathbf{V}^{0}_{1:K}\big{)}\Big{)}
  \triangleright Such that kwk=1\sum_{k}w_{k}=1
𝐏k=1Kwk𝐏k\mathbf{P}\leftarrow\sum_{k=1}^{K}w_{k}\mathbf{P}_{k}
  \triangleright Coarsen coordinates
𝐕~K𝐕1:K0:lmax\tilde{\mathbf{V}}\leftarrow\bigoplus_{K}\mathbf{V}^{0:{l_{\max}}}_{1:K}
  \triangleright Stack features
𝐏~i=1,j=1K,KY(𝐏i𝐏j)\tilde{\mathbf{P}}\leftarrow\bigoplus^{K,K}_{i=1,j=1}Y\left(\mathbf{P}_{i}-\mathbf{P}_{j}\right)
  \triangleright Stack relative vectors
𝐕0:lmaxLinear(𝐕~𝐏~)\mathbf{V}^{0:{l_{\max}}}\leftarrow\textrm{Linear}\Big{(}\tilde{\mathbf{V}}\oplus\tilde{\mathbf{P}}\Big{)}
  \triangleright Coarsen features and vectors
return (𝐏,𝐕0:lmax)(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})
Algorithm 4 Sequence Convolution

When S>1S>1, sequence convolutions reduce the dimensionality along the sequence axis, yielding mixed representations and coarse coordinates. To reverse this procedure, we introduce a transpose convolution algorithm that uses its 𝐕l=1\mathbf{V}^{l=1} features to spawn coordinates. For further details, please refer to Appendix A.6.

3.4 Spatial Convolution

To capture interactions of residues that are close in three-dimensional space, we introduce the Spatial Convolution. This operation updates representations and positions through message passing within k-nearest spatial neighbors. Message representations incorporate SO(3) signals from the vector difference between neighbor coordinates, and we aggregate messages with a permutation-invariant means. After aggregation, we linearly transform the vector representations into a an update for the coordinates.

Input: Latent Representations (𝐏,𝐕0:lmax)1:N(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})_{1:N}
Input: Output Node Index ii
(𝐏~,𝐕~0:lmax)1:kk-Nearest-Neighbors(𝐏i,𝐏1:N)(\tilde{\mathbf{P}},\tilde{\mathbf{V}}^{0:{l_{\max}}})_{1:k}\leftarrow\textrm{$k$-Nearest-Neighbors}(\mathbf{P}_{i},\mathbf{P}_{1:N})
R1:k,ϕ1:kEmbed(𝐏~1:k𝐏i2),Y(𝐏~1:k𝐏i)R_{1:k},\;\phi_{1:k}\leftarrow\textrm{Embed}(||\tilde{\mathbf{P}}_{1:k}-\mathbf{P}_{i}||_{2}),\;Y(\tilde{\mathbf{P}}_{1:k}-\mathbf{P}_{i})
  \triangleright Edge Features
𝐕~1:k0:lmaxMLP(Rk)(Linear(𝐕~1:k0:lmax)+Linear(ϕ1:k))\tilde{\mathbf{V}}^{0:{l_{\max}}}_{1:k}\leftarrow\textrm{MLP}(R_{k})\cdot\Big{(}\textrm{Linear}(\tilde{\mathbf{V}}^{0:{l_{\max}}}_{1:k})+\textrm{Linear}\left(\phi_{1:k}\right)\Big{)}
  \triangleright Prepare messages
𝐕0:lmaxLinear(𝐕i0:lmax+1k(k𝐕~k0:lmax))\mathbf{V}^{0:{l_{\max}}}\leftarrow\textrm{Linear}\left(\mathbf{V}^{0:{l_{\max}}}_{i}+\frac{1}{k}\left(\sum_{k}\tilde{\mathbf{V}}^{0:{l_{\max}}}_{k}\right)\right)
  \triangleright Aggregate and update
𝐏𝐏i+Linear(𝐕l=1)\mathbf{P}\leftarrow\mathbf{P}_{i}+\textrm{Linear}\left(\mathbf{V}^{l=1}\right)
  \triangleright Update positions
return (𝐏,𝐕0:lmax)(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})
Algorithm 5 Spatial Convolution

3.5 Deep Coarsening Autoencoder

We compose Space Convolution, Self-Interaction and Sequence Convolution modules to define a coarsening/refining block. The block mixes representations across all relevant axes of the domain while producing coarsened, downsampled positions and mixed embeddings. The reverse result – finer positions and decoupled embeddings – is achieved by changing the standard Sequence Convolution to its transpose counterpart. When employing sequence convolutions of stride S>1S>1, we increase the dimensionality of the feature representation according to a rescaling factor hyperparameter ρ\rho. We stack LL coarsening blocks to build a deep neural encoder \mathcal{E} (Alg.7), and symetrically LL refining blocks to build a decoder 𝒟\mathcal{D} (Alg. 8).

3.6 Autoencoder Reconstruction Losses

We use a number of reconstruction losses to ensure good quality of produced proteins.

Vector Map Loss. We train the model by directly comparing internal three-dimensional vector difference maps. Let V(𝐏)V(\mathbf{P}) denote the internal vector map between all atoms 𝐏\mathbf{P} in our data, that is, V(𝐏)i,j=(𝐏i𝐏j)3V(\mathbf{P})^{i,j}=(\mathbf{P}^{i}-\mathbf{P}^{j})\in\mathbb{R}^{3}. We define the vector map loss as VectorMap=HuberLoss(V(𝐏),V(𝐏^))\mathcal{L}_{\textrm{VectorMap}}=\textrm{HuberLoss}(V(\mathbf{P}),V(\hat{\mathbf{P}})) [Huber (1992)]. When computing this loss, an additional stage is employed for processing permutation symmetry breaks. More details can be found in Appendix B.1.

Residue Label Cross Entropy Loss. We train the model to predict logits \bm{\ell} over alphabet \mathcal{R} for each residue. We use the cross entropy between predicted logits and ground labels: CrossEntropy=CrossEntropy(,𝐑)\mathcal{L}_{\textrm{CrossEntropy}}=\textrm{CrossEntropy}(\bm{\ell},\mathbf{R})

Chemistry Losses. We incorporate L2L_{2} norm-based losses for comparing bonds, angles and dihedrals between prediction and ground truth. For non-bonded atoms, a clash loss is evaluated using standard Van der Waals atomic radii (Appendix B.2)

Please refer to Appendix B for further details on the loss.

3.7 Latent Diffusion

We train an SO(3)-equivariant DDPM [Ho et al. (2020); Sohl-Dickstein et al. (2015)] on the latent space of our autoencoder. We pre-train an autoencoder and transform each protein 𝐗\mathbf{X} from the dataset into a geometric tensor of irreducible representations of SO(3): 𝐙=(𝐗)\mathbf{Z}=\mathcal{E}(\mathbf{X}). We attach a diffusion process of TT steps on the latent variables, making 𝐙0=𝐙\mathbf{Z}_{0}=\mathbf{Z} and 𝐙T=𝒩(0,1)\mathbf{Z}_{T}=\mathcal{N}(0,1). We follow the parameterization described in [Salimans & Ho (2022)], and train a denoising model to reconstruct the original data 𝐙0\mathbf{Z}_{0} from its noised version 𝐙t\mathbf{Z}_{t}:

diffusion=𝔼ϵ,t[w(λt)𝐙^0(𝐙t)𝐙022]\mathcal{L}_{\textrm{diffusion}}=\mathbb{E}_{{\epsilon},t}\big{[}w(\lambda_{t})||\hat{\mathbf{Z}}_{0}(\mathbf{Z}_{t})-\mathbf{Z}_{0}||^{2}_{2}\big{]}

In order to ensure that bottleneck representations 𝐙0\mathbf{Z}_{0} are well-behaved for generation purposes, we regularize the latent space of our autoencoder (Appendix B.3). We build a denoising network with LDL_{D} layers of Self-Interaction. Please refer to Appendix F for further details.

3.8 Implementation

We train all models on single-GPU A6000 and GTX-1080 Ti machines. We implement Ophiuchus in Jax using the python libraries e3nn-jax [Geiger & Smidt (2022)] and Haiku [Hennigan et al. (2020)].

4 Methods and Results

4.1 Autoencoder Architecture Comparison

We compare Ophiuchus to the architecture proposed in [Fu et al. (2023)], which uses the EGNN-based architecture [Satorras et al. (2022)] for autoencoding protein backbones. To the best of our knowledge, this is the only other model that attempts protein reconstruction in three-dimensions with roto-translation equivariant networks. For demonstration purposes, we curate a small dataset of protein 𝐂α\mathbf{C}_{\alpha}-backbones from the PDB with lengths between 16 and 64 and maximum resolution of up to 1.5 Å, resulting in 1267 proteins. We split the data into train, validation and test sets with ratio [0.8, 0.1, 0.1]. In table 1, we report the test performance at best validation step, while avoiding over-fitting during training.

Table 1: Reconstruction from single feature bottleneck

Model Downsampling Factor Channels/Layer #\# Params [1e6] Cα\alpha-RMSD (Å) \downarrow Residue Acc. (%\%) \uparrow \downarrow EGNN 2 [32, 32] 0.68 1.01 88 EGNN 4 [32, 32, 48] 1.54 1.12 80 EGNN 8 [32, 32, 48, 72] 3.30 2.06 73 EGNN 16 [32, 32, 48, 72, 108] 6.99 11.4 25 Ophiuchus 2 [5, 7] 0.018 0.11 98 Ophiuchus 4 [5, 7, 10] 0.026 0.14 97 Ophiuchus 8 [5, 7, 10, 15] 0.049 0.36 79 Ophiuchus 16 [5, 7, 10, 15, 22] 0.068 0.43 37

We find that Ophiuchus vastly outperforms the EGNN-based architecture. Ophiuchus recovers the protein sequence and 𝐂α\mathbf{C}_{\alpha} backbone with significantly better accuracy, while using orders of magnitude less parameters. Refer to Appendix C for further details.

4.2 Architecture Ablation

To investigate the effects of different architecture layouts and coarsening rates, we train different instantiations of Ophiuchus to coarsen representations of contiguous 160-sized protein fragments from the Protein Databank (PDB) [Berman et al. (2000)]. We filter out entries tagged with resolution higher than 2.5 Å, and total sequence lengths larger than 512. We also ensure proteins in the dataset have same chirality. For ablations, the sequence convolution uses kernel size of 5 and stride 3, channel rescale factor per layer is one of {1.5, 1.7, 2.0}, and the number of downsampling layers ranges in 3-5. The initial residue representation uses 16 channels, where each channel is composed of one scalar (l=0l=0) and one 3D vector (l=1l=1). All experiments were repeated 3 times with different initialization seeds and data shuffles.

Table 2: Ophiuchus Recovery from Compressed Representations - All-Atom

Downsampling Factor Channels/Layer # Params [1e6] Cα\alpha-RMSD (Å) \downarrow All-Atom RMSD (Å) \downarrow GDT-TS \uparrow GDT-HA \uparrow Residue Acc. (%) \uparrow 17 [16, 24, 36] 0.34 0.90 ±\pm 0.20 0.68 ±\pm 0.08 94 ±\pm 3 76 ±\pm 4 97 ±\pm 2 17 [16, 27, 45] 0.38 0.89 ±\pm 0.21 0.70 ±\pm 0.09 94 ±\pm 3 77 ±\pm 5 98 ±\pm 1 17 [16, 32, 64] 0.49 1.02 ±\pm 0.25 0.72 ±\pm 0.09 92 ±\pm 4 73 ±\pm 5 98 ±\pm 1 53 [16, 24, 36, 54] 0.49 1.03 ±\pm 0.18 0.83 ±\pm 0.10 91 ±\pm 3 72 ±\pm 5 60 ±\pm 4 53 [16, 27, 45, 76] 0.67 0.92 ±\pm 0.19 0.77 ±\pm 0.09 93 ±\pm 3 75 ±\pm 5 66 ±\pm 4 53 [16, 32, 64, 128] 1.26 1.25 ±\pm 0.32 0.80 ±\pm 0.16 86 ±\pm 5 65 ±\pm 6 67 ±\pm 5 160 [16, 24, 36, 54, 81] 0.77 1.67 ±\pm 0.24 1.81 ±\pm 0.16 77 ±\pm 4 54 ±\pm 5 17 ±\pm 3 160 [16, 27, 45, 76, 129] 1.34 1.39 ±\pm 0.23 1.51 ±\pm 0.17 83 ±\pm 4 60 ±\pm 5 17 ±\pm 3 160 [16, 32, 64, 128, 256] 3.77 1.21 ±\pm 0.25 1.03 ±\pm 0.15 87 ±\pm 5 65 ±\pm 6 27 ±\pm 4

In our experiments we find a trade-off between domain coarsening factor and reconstruction performance. In Table 2 we show that although residue recovery suffers from large downsampling factors, structure recovery rates remain comparable between various settings. Moreover, we find that we are able to model all atoms in proteins, as opposed to only Cα\alpha atoms (as commonly done), and still recover the structure with high precision. These results demonstrate that protein data can be captured effectively and efficiently using sequence-modular geometric representations. We directly utilize the learned compact latent space as shown below by various examples. For further ablation analysis please refer to Appendix D.

Refer to caption
Figure 3: Protein Latent Diffusion. (a) We attach a diffusion process to Ophiuchus representations and learn a denoising network to sample embeddings of SO(3) that decode to protein structures. (b) Random samples from 485-length backbone model.

4.3 Latent Conformational Interpolation

To demonstrate the power of Ophiuchus’s geometric latent space, we show smooth interpolation between two states of a protein structure without explicit latent regularization (as opposed to [Ramaswamy et al. (2021)]). We use the PDBFlex dataset [Hrabe et al. (2016)] and pick pairs of flexible proteins. Conformational snapshots of these proteins are used as the endpoints for the interpolation. We train a large Ophiuchus reconstruction model on general PDB data. The model coarsens up to 485-residue proteins into a single high-order geometric representation using 6 convolutional downsampling layers each with kernel size 5 and stride 3. The endpoint structures are compressed into single geometric representation which enables direct latent interpolation in feature .

We compare the results of linear interpolation in the latent space against linear interpolation in the coordinate domain (Fig. 11). To determine chemical validity of intermediate states, we scan protein data to quantify average bond lengths and inter-bond angles. We calculate the L2L_{2} deviation from these averages for bonds and angles of interpolated structures. Additionally, we measure atomic clashes by counting collisions of Van der-Waals radii of non-bonded atoms (Fig. 11). Although the latent and autoencoder-reconstructed interpolations perform worse than direct interpolation near the original data points, we find that only the latent interpolation structures maintain a consistent profile of chemical validity throughout the trajectory, while direct interpolation in the coordinate domain disrupts it significantly. This demonstrates that the learned latent space compactly and smoothly represents protein conformations.

4.4 Latent Diffusion Experiments and Benchmarks

Our ablation study (Tables 2 and 4) shows successful recovery of backbone structure of large proteins even for large coarsening rates. However, we find that for sequence reconstruction, larger models and longer training times are required. During inference, all-atom models rely on the decoding of the sequence, thus making significantly harder for models to resolve all-atom generation. Due to computational constraints, we investigate all-atom latent diffusion models for short sequence lengths, and focus on backbone models for large proteins. We train the all-atom models with mini-proteins of sequences shorter than 64 residues, leveraging the MiniProtein scaffolds dataset produced by [Cao et al. (2022)]. In this regime, our model is precise and successfully reconstructs sequence and all-atom positions. We also instantiate an Ophiuchus model for generating the backbone trace for large proteins of length 485. For that, we train our model on the PDB data curated by [Yim et al. (2023)]. We compare the quality of diffusion samples from our model to RFDiffusion (T=50) and FrameDiff (N=500 and noise=0.1) samples of similar lengths. We generated 500 unconditional samples from each of the models for evaluation.

In Table 3 we compare sampling the latent space of Ophiuchus to existing models. We find that our model performs comparably in terms of different generated sample metrics, while enabling orders of magnitude faster sampling for proteins. For all comparisons we run all models on a single RTX6000 GPU. Please refer to Appendix F for more details.

Table 3: Comparison to different diffusion models.

Model Dataset Sampling Time (s) \downarrow scRMSD (< 2Å) \uparrow scTM (>0.5) \uparrow Diversity \uparrow FrameDiff [Yim et al. (2023)] PDB 8.6 0.17 0.81 0.42 RFDiffusion [Trippe et al. (2023)] PDB + AlphaFold DB 50 0.79 0.99 0.64 Ophiuchus-64 All-Atom MiniProtein 0.15 0.32 0.56 0.72 Ophiuchus-485 Backbone PDB 0.46 0.18 0.36 0.39

5 Conclusion and Future Work

In this work, we introduced a new autoencoder model for protein structure and sequence representation. Through extensive ablation on its architecture, we quantified the trade-offs between model complexity and representation quality. We demonstrated the power of our learned representations in latent interpolation, and investigated its usage as basis for efficient latent generation of backbone and all-atom protein structures. Our studies suggest Ophiuchus to provide a strong foundation for constructing state-of-the-art protein neural architectures. In future work, we will investigate scaling Ophiuchus representations and generation to larger proteins and additional molecular domains.

References

  • Anand & Achim (2022) Namrata Anand and Tudor Achim. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models, 2022.
  • Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016.
  • Baek et al. (2021a) Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N. Kinch, R. Dustin Schaeffer, Claudia Millán, Hahnbeom Park, Carson Adams, Caleb R. Glassman, Andy DeGiovanni, Jose H. Pereira, Andria V. Rodrigues, Alberdina A. van Dijk, Ana C. Ebrecht, Diederik J. Opperman, Theo Sagmeister, Christoph Buhlheller, Tea Pavkov-Keller, Manoj K. Rathinaswamy, Udit Dalwadi, Calvin K. Yip, John E. Burke, K. Christopher Garcia, Nick V. Grishin, Paul D. Adams, Randy J. Read, and David Baker. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, August 2021a. doi: 10.1126/science.abj8754. URL https://doi.org/10.1126/science.abj8754.
  • Baek et al. (2021b) Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N Kinch, R Dustin Schaeffer, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021b.
  • Batatia et al. (2022) Ilyes Batatia, David Peter Kovacs, Gregor N. C. Simm, Christoph Ortner, and Gabor Csanyi. MACE: Higher order equivariant message passing neural networks for fast and accurate force fields. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=YPpSngE-ZU.
  • Bentley (1975) Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.
  • Berman et al. (2000) Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank. Nucleic acids research, 28(1):235–242, 2000.
  • Bronstein et al. (2021) Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Velickovic. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. CoRR, abs/2104.13478, 2021. URL https://arxiv.org/abs/2104.13478.
  • Bystroff & Baker (1998) Christopher Bystroff and David Baker. Prediction of local structure in proteins using a library of sequence-structure motifs. Journal of molecular biology, 281(3):565–577, 1998.
  • Cao et al. (2022) Longxing Cao, Brian Coventry, Inna Goreshnik, Buwei Huang, William Sheffler, Joon Sung Park, Kevin M Jude, Iva Marković, Rameshwar U Kadam, Koen HG Verschueren, et al. Design of protein-binding proteins from the target structure alone. Nature, 605(7910):551–560, 2022.
  • Deng et al. (2021) Congyue Deng, Or Litany, Yueqi Duan, Adrien Poulenard, Andrea Tagliasacchi, and Leonidas Guibas. Vector neurons: A general framework for so(3)-equivariant networks, 2021.
  • Eguchi et al. (2020) Raphael R. Eguchi, Christian A. Choe, and Po-Ssu Huang. Ig-VAE: Generative modeling of protein structure by direct 3d coordinate generation. PLOS Computational Biology, August 2020. doi: 10.1101/2020.08.07.242347. URL https://doi.org/10.1101/2020.08.07.242347.
  • Elfwing et al. (2017) Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning, 2017.
  • Elnaggar et al. (2023) Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Charlotte Rochereau, and Burkhard Rost. Ankh: Optimized protein language model unlocks general-purpose modelling, 2023.
  • Fan et al. (2022) Hehe Fan, Zhangyang Wang, Yi Yang, and Mohan Kankanhalli. Continuous-discrete convolution for geometry-sequence modeling in proteins. In The Eleventh International Conference on Learning Representations, 2022.
  • Fu et al. (2023) Cong Fu, Keqiang Yan, Limei Wang, Wing Yee Au, Michael McThrow, Tao Komikado, Koji Maruhashi, Kanji Uchino, Xiaoning Qian, and Shuiwang Ji. A latent diffusion model for protein structure generation, 2023.
  • Fuchs et al. (2020) Fabian B. Fuchs, Daniel E. Worrall, Volker Fischer, and Max Welling. Se(3)-transformers: 3d roto-translation equivariant attention networks, 2020.
  • Geiger & Smidt (2022) Mario Geiger and Tess Smidt. e3nn: Euclidean neural networks, 2022.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
  • Hennigan et al. (2020) Tom Hennigan, Trevor Cai, Tamara Norman, Lena Martens, and Igor Babuschkin. Haiku: Sonnet for JAX. 2020. URL http://github.com/deepmind/dm-haiku.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020.
  • Hou et al. (2018) Jie Hou, Badri Adhikari, and Jianlin Cheng. Deepsf: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics, 34(8):1295–1303, 2018.
  • Hrabe et al. (2016) Thomas Hrabe, Zhanwen Li, Mayya Sedova, Piotr Rotkiewicz, Lukasz Jaroszewski, and Adam Godzik. Pdbflex: exploring flexibility in protein structures. Nucleic acids research, 44(D1):D423–D428, 2016.
  • Huber (1992) Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pp.  492–518. Springer, 1992.
  • Ingraham et al. (2022) John Ingraham, Max Baranov, Zak Costello, Vincent Frappier, Ahmed Ismail, Shan Tie, Wujie Wang, Vincent Xue, Fritz Obermeyer, Andrew Beam, and Gevorg Grigoryan. Illuminating protein space with a programmable generative model. biorxiv, December 2022. doi: 10.1101/2022.12.01.518682. URL https://doi.org/10.1101/2022.12.01.518682.
  • Jing et al. (2021) Bowen Jing, Stephan Eismann, Patricia Suriana, Raphael J. L. Townshend, and Ron Dror. Learning from protein structure with geometric vector perceptrons, 2021.
  • Jumper et al. (2021) John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  • Karydis (2017) Thrasyvoulos Karydis. Learning hierarchical motif embeddings for protein engineering. PhD thesis, Massachusetts Institute of Technology, 2017.
  • King & Koes (2021) Jonathan Edward King and David Ryan Koes. Sidechainnet: An all-atom protein structure dataset for machine learning. Proteins: Structure, Function, and Bioinformatics, 89(11):1489–1496, 2021.
  • Kmiecik et al. (2016) Sebastian Kmiecik, Dominik Gront, Michal Kolinski, Lukasz Wieteska, Aleksandra Elzbieta Dawid, and Andrzej Kolinski. Coarse-grained protein models and their applications. Chemical reviews, 116(14):7898–7936, 2016.
  • Lee et al. (2019) Ingoo Lee, Jongsoo Keum, and Hojung Nam. Deepconv-dti: Prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS computational biology, 15(6):e1007129, 2019.
  • Li et al. (2022) Alex J. Li, Vikram Sundar, Gevorg Grigoryan, and Amy E. Keating. Terminator: A neural framework for structure-based protein design using tertiary repeating motifs, 2022.
  • Liao & Smidt (2023) Yi-Lun Liao and Tess Smidt. Equiformer: Equivariant graph attention transformer for 3d atomistic graphs, 2023.
  • Lin & AlQuraishi (2023) Yeqing Lin and Mohammed AlQuraishi. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds, 2023.
  • Lin et al. (2021) Zeming Lin, Tom Sercu, Yann LeCun, and Alexander Rives. Deep generative models create new and diverse protein structures. In Machine Learning for Structural Biology Workshop, NeurIPS, 2021.
  • Lin et al. (2023) Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
  • Liu et al. (2022) David D Liu, Ligia Melo, Allan Costa, Martin Vögele, Raphael JL Townshend, and Ron O Dror. Euclidean transformers for macromolecular structures: Lessons learned. 2022 ICML Workshop on Computational Biology, 2022.
  • Mackenzie & Grigoryan (2017) Craig O Mackenzie and Gevorg Grigoryan. Protein structural motifs in prediction and design. Current opinion in structural biology, 44:161–167, 2017.
  • Mackenzie et al. (2016) Craig O. Mackenzie, Jianfu Zhou, and Gevorg Grigoryan. Tertiary alphabet for the observable protein structural universe. Proceedings of the National Academy of Sciences, 113(47), November 2016. doi: 10.1073/pnas.1607178113. URL https://doi.org/10.1073/pnas.1607178113.
  • Mansoor et al. (2023) Sanaa Mansoor, Minkyung Baek, Hahnbeom Park, Gyu Rie Lee, and David Baker. Protein ensemble generation through variational autoencoder latent space sampling. bioRxiv, pp.  2023–08, 2023.
  • Miller et al. (2020) Benjamin Kurt Miller, Mario Geiger, Tess E. Smidt, and Frank Noé. Relevance of rotationally equivariant convolutions for predicting molecular properties. CoRR, abs/2008.08461, 2020. URL https://arxiv.org/abs/2008.08461.
  • Ramaswamy et al. (2021) Venkata K Ramaswamy, Samuel C Musson, Chris G Willcocks, and Matteo T Degiacomi. Deep learning protein conformational space with convolutions and latent interpolations. Physical Review X, 11(1):011052, 2021.
  • Salimans & Ho (2022) Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models, 2022.
  • Satorras et al. (2022) Victor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E(n) equivariant graph neural networks, 2022.
  • Smidt (2021) Tess E Smidt. Euclidean symmetry and equivariance in machine learning. Trends in Chemistry, 3(2):82–85, 2021.
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics, 2015.
  • Swanson et al. (2022) Sebastian Swanson, Venkatesh Sivaraman, Gevorg Grigoryan, and Amy E Keating. Tertiary motifs as building blocks for the design of protein-binding peptides. Protein Science, 31(6):e4322, 2022.
  • Thomas et al. (2018) Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotation- and translation-equivariant neural networks for 3d point clouds, 2018.
  • Townshend et al. (2022) Raphael J. L. Townshend, Martin Vögele, Patricia Suriana, Alexander Derry, Alexander Powers, Yianni Laloudakis, Sidhika Balachandar, Bowen Jing, Brandon Anderson, Stephan Eismann, Risi Kondor, Russ B. Altman, and Ron O. Dror. Atom3d: Tasks on molecules in three dimensions, 2022.
  • Trippe et al. (2023) Brian L. Trippe, Jason Yim, Doug Tischer, David Baker, Tamara Broderick, Regina Barzilay, and Tommi Jaakkola. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem, 2023.
  • Vallat et al. (2015) Brinda Vallat, Carlos Madrid-Aliste, and Andras Fiser. Modularity of protein folds as a tool for template-free modeling of structures. PLoS computational biology, 11(8):e1004419, 2015.
  • Visani et al. (2023) Gian Marco Visani, Michael N. Pun, Arman Angaji, and Armita Nourmohammad. Holographic-(v)ae: an end-to-end so(3)-equivariant (variational) autoencoder in fourier space, 2023.
  • Wang & Gómez-Bombarelli (2019) Wujie Wang and Rafael Gómez-Bombarelli. Coarse-graining auto-encoders for molecular dynamics. npj Computational Materials, 5(1):125, 2019.
  • Wang et al. (2022) Wujie Wang, Minkai Xu, Chen Cai, Benjamin Kurt Miller, Tess Smidt, Yusu Wang, Jian Tang, and Rafael Gómez-Bombarelli. Generative coarse-graining of molecular conformations, 2022.
  • Watson et al. (2022) Joseph L. Watson, David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, Andrew J. Borst, Robert J. Ragotte, Lukas F. Milles, Basile I. M. Wicky, Nikita Hanikel, Samuel J. Pellock, Alexis Courbet, William Sheffler, Jue Wang, Preetham Venkatesh, Isaac Sappington, Susana Vázquez Torres, Anna Lauko, Valentin De Bortoli, Emile Mathieu, Regina Barzilay, Tommi S. Jaakkola, Frank DiMaio, Minkyung Baek, and David Baker. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv, 2022. doi: 10.1101/2022.12.09.519842. URL https://www.biorxiv.org/content/early/2022/12/10/2022.12.09.519842.
  • Watson et al. (2023) Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with rfdiffusion. Nature, pp.  1–3, 2023.
  • Wehmeyer & Noé (2018) Christoph Wehmeyer and Frank Noé . Time-lagged autoencoders: Deep learning of slow collective variables for molecular kinetics. The Journal of Chemical Physics, 148(24), mar 2018. doi: 10.1063/1.5011399. URL https://doi.org/10.1063%2F1.5011399.
  • Weiler et al. (2018) Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco Cohen. 3d steerable cnns: Learning rotationally equivariant features in volumetric data, 2018.
  • Wieder et al. (2020) Oliver Wieder, Stefan Kohlbacher, Mélaine Kuenemann, Arthur Garon, Pierre Ducrot, Thomas Seidel, and Thierry Langer. A compact review of molecular property prediction with graph neural networks. Drug Discovery Today: Technologies, 37:1–12, 2020.
  • Winter et al. (2021) Robin Winter, Frank Noé, and Djork-Arné Clevert. Auto-encoding molecular conformations, 2021.
  • Wu et al. (2022a) Kevin E. Wu, Kevin K. Yang, Rianne van den Berg, James Y. Zou, Alex X. Lu, and Ava P. Amini. Protein structure generation via folding diffusion, 2022a.
  • Wu et al. (2022b) Ruidong Wu, Fan Ding, Rui Wang, Rui Shen, Xiwen Zhang, Shitong Luo, Chenpeng Su, Zuofan Wu, Qi Xie, Bonnie Berger, et al. High-resolution de novo structure prediction from primary sequence. BioRxiv, pp.  2022–07, 2022b.
  • Yang et al. (2022) Kevin K Yang, Nicolo Fusi, and Alex X Lu. Convolutions are competitive with transformers for protein sequence pretraining. bioRxiv, pp.  2022–05, 2022.
  • Yang & Gómez-Bombarelli (2023) Soojung Yang and Rafael Gómez-Bombarelli. Chemically transferable generative backmapping of coarse-grained proteins, 2023.
  • Yim et al. (2023) Jason Yim, Brian L. Trippe, Valentin De Bortoli, Emile Mathieu, Arnaud Doucet, Regina Barzilay, and Tommi Jaakkola. Se(3) diffusion model with application to protein backbone generation, 2023.
  • Zhang et al. (2021) Xiao-Meng Zhang, Li Liang, Lin Liu, and Ming-Jing Tang. Graph neural networks and their current applications in bioinformatics. Frontiers in genetics, 12:690049, 2021.
  • Zhang et al. (2022) Zuobai Zhang, Minghao Xu, Arian Jamasb, Vijil Chenthamarakshan, Aurelie Lozano, Payel Das, and Jian Tang. Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125, 2022.

Appendix

Appendix A Architecture Details

A.1 All-Atom Representation

A canonical ordering of the atoms of each residue enables the local geometry to be described in a stacked array representation, where each feature channel corresponds to an atom. To directly encode positions, we stack the 3D coordinates of each atom. The coordinates vector behaves as the irreducible-representation of SO(3)SO(3) of degree l=1l=1. The atomic coordinates are taken relative to the 𝐂α\mathbf{C}_{\alpha} of each residue. In practice, for implementing this ordering we follow the atom14 tensor format of SidechainNet [King & Koes (2021)], where a vector P14×3P\in\mathbb{R}^{14\times 3} contains the atomic positions per residue. In Ophiuchus, we rearrange this encoding: one of those dimensions, the 𝐂α\mathbf{C}_{\alpha} coordinate, is used as the absolute position; the 13 remaining 3D-vectors are centered at the 𝐂α\mathbf{C}_{\alpha}, and used as geometric features. The geometric features of residues with fewer than 14 atoms are zero-padded (Figure 4).

Still, four of the standard residues (Aspartic Acid, Glutamic Acid, Phenylalanine and Tyrosine) have at most two pairs of atoms that are interchangeable, due to the presence 180180^{\circ}-rotation symmetries [Jumper et al. (2021)]. In Figure 5, we show how stacking their relative positions leads to representations that differ even when the same structure occurs across different rotations of a side-chain. To solve this issue, instead of stacking two 3D vectors (𝐏u,𝐏v\mathbf{P}^{*}_{u},\mathbf{P}^{*}_{v}), our method uses a single l=1l=1 vector 𝐕centerl=1=12(𝐏v+𝐏u)\mathbf{V}^{l=1}_{\textrm{center}}=\frac{1}{2}(\mathbf{P}^{*}_{v}+\mathbf{P}^{*}_{u}). The mean makes this feature invariant to the (v,u)(v,u) atomic permutations, and the resulting vector points to the midpoint between the two atoms. To fully describe the positioning, difference (𝐏v𝐏u)(\mathbf{P}^{*}_{v}-\mathbf{P}^{*}_{u}) must be encoded as well. For that, we use a single l=2l=2 feature 𝐕diffl=2=Yl=2(12(𝐏v𝐏u))\mathbf{V}^{l=2}_{\textrm{diff}}=Y_{l=2}\big{(}\frac{1}{2}(\mathbf{P}^{*}_{v}-\mathbf{P}^{*}_{u})\big{)}. This feature is produced by projecting the difference of positions into a degree l=2l=2 spherical harmonics basis. Let x3x\in\mathbb{R}^{3} denote a 3D vector. Then its projection into a feature of degree l=2l=2 is defined as:

Y2(x)=[15x0x2,15x0x1,52(x02+2x12x22),15x1x2,152(x02+x22)]Y_{2}(x)=\left[\sqrt{15}\cdot x_{0}\cdot x_{2},\sqrt{15}\cdot x0\cdot x1,\frac{\sqrt{5}}{2}\left(-x_{0}^{2}+2x_{1}^{2}-x_{2}^{2}\right),\sqrt{15}\cdot x_{1}\cdot x_{2},\frac{\sqrt{15}}{2}\cdot(-x_{0}^{2}+x_{2}^{2})\right]

Where each dimension of the resulting term is indexed by the order m[l,l]=[2,2]m\in[-l,l]=[-2,2], for degree l=2l=2. We note that for m{2,1,1}m\in\{-2,-1,1\}, two components of xx directly multiply, while for m{0,2}m\in\{0,2\} only squared terms of xx are present. In both cases, the terms are invariant to flipping the sign of xx, such that Y2(x)=Y2(x)Y_{2}(x)=Y_{2}(-x). Equivalently, 𝐕diffl=2\mathbf{V}^{l=2}_{\textrm{diff}} is invariant to reordering of the two atoms (v,u)(u,v)(v,u)\rightarrow(u,v):

𝐕diffl=2=Yl=2(12(𝐏v𝐏u))=Yl=2(12(𝐏u𝐏v))\mathbf{V}^{l=2}_{\textrm{diff}}=Y_{l=2}\left(\frac{1}{2}(\mathbf{P}^{*}_{v}-\mathbf{P}^{*}_{u})\right)=Y_{l=2}\left(\frac{1}{2}(\mathbf{P}^{*}_{u}-\mathbf{P}^{*}_{v})\right)

In Figure 5, we compare the geometric latent space of a network that uses this permutation invariant encoding, versus one that uses naive stacking of atomic positions 𝐏\mathbf{P}^{*}. We find that Ophiuchus correctly maps the geometry of the data, while direct stacking leads to representations that do not reflect the underlying symmetries.

Refer to caption
Figure 4: Geometric Representations of Protein Residues. (a) We leverage the bases of spherical harmonics to represent signals on SO(3). (b-d) Examples of individual protein residues encoded in spherical harmonics bases: residue label is directly encoded in a order l=0l=0 representation as a one-hot vector; atom positions in a canonical ordering are encoded as l=1l=1 features; additionally, unorderable atom positions are encoded as l=1l=1 and l=2l=2 features that are invariant to their permutation flips. In the figures, we displace l=2l=2 features for illustrative purposes – in practice, the signal is processed as centered in the 𝐂α\mathbf{C}_{\alpha}. (b) Tryptophan is the largest residue we consider, utilizing all dimensions of nmax=13n_{\textrm{max}}=13 atoms in the input 𝐕orderedl=1\mathbf{V}^{l=1}_{\textrm{ordered}}. (c) Tyrosine needs padding for 𝐕l=1\mathbf{V}^{l=1}, but produces two 𝐕l=2\mathbf{V}^{l=2} from its two pairs of permutable atoms. (d) Aspartic Acid has a single pair of permutable atoms, and its 𝐕l=2\mathbf{V}^{l=2} is padded.
Refer to caption
Figure 5: Permutation Invariance of Side-chain Atoms in Stacked Geometric Representations. (a) We provide an example in which we rotate the side-chain of Tyrosine by 2π2\pi radians while keeping the ordering of atoms fixed. Note that the structure is the same after rotation by π\pi radians. (b) An SO(3)-equivariant network may stack atomic relative positions in a canonical order. However, because of the permutation symmetries, the naive stacked representation will lead to latent representations that are different even when the data is geometrically the same. To demonstrate this, we plot two components (m=1m=-1 and m=0m=0) of an internal 𝐕l=1\mathbf{V}^{l=1} feature, while rotating the side-chain positions by 2π2\pi radians. This network represents the structures rotated by θ=0\theta=0 (red) and θ=π\theta=\pi (cyan) differently despite having exactly the same geometric features. (c-d) Plots of internal 𝐕l=1,2\mathbf{V}^{l=1,2} of Ophiuchus which encodes the position of permutable pairs jointly as an l=1l=1 center of symmetry and an l=2l=2 difference of positions, resulting in the same representation for structures rotated by θ=0\theta=0 (red) and θ=π\theta=\pi (cyan).

A.2 All-Atom Decoding

Given a latent representation at the residue-level, (𝐏,𝐕0:lmax)(\mathbf{P},\mathbf{V}^{0:{l_{\max}}}), we take 𝐏\mathbf{P} directly as the 𝐂α\mathbf{C}_{\alpha} position of the residue. The hidden scalar representations 𝐕l=0\mathbf{V}^{l=0} are transformed into categorical logits ||\bm{\ell}\in\mathbb{R}^{|\mathcal{R}|} to predict the probabilities of residue label 𝐑^\hat{\mathbf{R}}. To decode the side-chain atoms, Ophiuchus produces relative positions to 𝐂α\mathbf{C}_{\alpha} for all atoms of each residue. During training, we enforce the residue label to be the ground truth. During inference, we output the residue label corresponding to the largest logit value.

Relative coordinates are produced directly from geometric features. We linearly project 𝐕l=1\mathbf{V}^{l=1} to obtain the relative position vectors of orderable atoms 𝐕^orderedl=1\hat{\mathbf{V}}^{l=1}_{\textrm{ordered}}. To decode positions (𝐏^v,𝐏^u)(\hat{\mathbf{P}}^{*}_{v},\hat{\mathbf{P}}^{*}_{u}) for an unorderable pair of atoms (v,u)(v,u), we linearly project 𝐕l>0\mathbf{V}^{l>0} to predict 𝐕^centerl=1\hat{\mathbf{V}}^{l=1}_{\textrm{center}} and 𝐕^diffl=2\hat{\mathbf{V}}^{l=2}_{\textrm{diff}}. To produce two relative positions out of 𝐕^diffl=2\hat{\mathbf{V}}^{l=2}_{\textrm{diff}}, we determine the rotation axis around which the feature rotates the least by taking the left-eigenvector with smallest eigenvalue of 𝒳l=2𝐕^diffl=2\mathcal{X}^{l=2}\hat{\mathbf{V}}^{l=2}_{\textrm{diff}}, where 𝒳l=2\mathcal{X}^{l=2} is the generator of the irreducible representations of degree l=2l=2. We illustrate this procedure in Figure 6 and explain the process in detail in the caption. This method proves effective because the output direction of this axis is inherently ambiguous, aligning perfectly with our requirement for the vectors to be unorderable.

Refer to caption
Figure 6: Decoding Permutable Atoms. Sketch on how to decode a double-sided arrow (𝐕l=2\mathbf{V}^{l=2} signals) into two unordered vectors (𝐏^u\hat{\mathbf{P}}^{*}_{u}, 𝐏^v\hat{\mathbf{P}}^{*}_{v}). Yl=2:35Y^{l=2}:\mathbb{R}^{3}\to\mathbb{R}^{5} can be viewed as a 3 dimensional manifold embed in a 5 dimensional space. We exploit the fact that the points on that manifold are unaffected by a rotation around their pre-image vector. To extend the definition to all the points of 5\mathbb{R}^{5} (i.e. also outside of the manifold), we look for the axis of rotation with the smallest impact on the 5d entry. For that, we compute the left-eigenvector with the smallest eigenvalue of the action of the generators on the input point: 𝒳l=2𝐕^diffl=2\mathcal{X}^{l=2}\hat{\mathbf{V}}^{l=2}_{\textrm{diff}}. The resulting axis is used as a relative position ±Δ𝐏v,u\pm\Delta\mathbf{P}^{*}_{v,u} between the two atoms, and is used to recover atomic positions through 𝐏v=𝐕centerl=1+Δ𝐏v,u\mathbf{P}^{*}_{v}=\mathbf{V}^{l=1}_{\textrm{center}}+\Delta\mathbf{P}^{*}_{v,u} and 𝐏u=𝐕centerl=1Δ𝐏v,u\mathbf{P}^{*}_{u}=\mathbf{V}^{l=1}_{\textrm{center}}-\Delta\mathbf{P}^{*}_{v,u}.
Refer to caption
Figure 7: Building Blocks of Self-Interaction. (a) Self-Interaction updates only SO(3)-equivariant features, which are represented as a DD channels each with vectors of degree ll up to degree lmaxl_{\textrm{max}}. (b) A roto-translational equivariant linear layer transforms only within the same order ll. (c-d) we use the tensor square operation to interact features across different degrees ll. We employ two instantiations of this operation. (c) The Self-Interaction in autoencoder models applies the square operation within the same channel of the representation, interacting features with themselves across ll of same channel dimension d[0,D]d\in[0,D]. (d) The Self-Interaction in diffusion models chunks the representation in groups of channels before the square operation. It is more expressive, but imposes a harder computational cost.

A.3 Details of Self-Interaction

The objective of our Self-Interaction module is to function exclusively based on the geometric features 𝐕0:lmax\mathbf{V}^{0:{l_{\max}}}, while concurrently mixing irreducible representations across various ll values. To accomplish this, we calculate the tensor product of the representation with itself; this operation is termed the "tensor square" and is symbolized by (𝐕0:lmax)2\left(\mathbf{V}^{0:{l_{\max}}}\right)^{\otimes 2}. As the channel dimensions expand, the computational complexity tied to the tensor square increases quadratically. To solve this computational load, we instead perform the square operation channel-wise, or by chunks of channels. Figures 7.c and 7.d illustrate these operations. After obtaining the squares from the chunked or individual channels, the segmented results are subsequently concatenated to generate an updated representation 𝐕0:lmax\mathbf{V}^{0:{l_{\max}}}, which is transformed through a learnable linear layer to the output dimensionality.

A.4 Non-Linearities

To incorporate non-linearities in our geometric representations 𝐕l=0:lmax\mathbf{V}^{l=0:l_{\textrm{max}}}, we employ a similar roto-translation equivariant gate mechanism as described in Equiformer [Liao & Smidt (2023)]. This mechanism is present at the last step of Self-Interaction and in the message preparation step of Spatial Convolution (Figure (2)). In both cases, we implement the activation by first isolating the scalar representations 𝐕l=0\mathbf{V}^{l=0} and transforming them through a standard MultiLayerPerceptron (MLP). We use the SiLu activation function [Elfwing et al. (2017)] after each layer of the MLP. In the output vector, a scalar is produced for and multiplied into each channel of 𝐕l=0:lmax\mathbf{V}^{l=0:l_{\textrm{max}}}.

Refer to caption
Figure 8: Convolutions of Ophiuchus. (a) In the Spatial Convolution, we update the feature representation 𝐕i0:lmax\mathbf{V}_{i}^{0:l_{\textrm{max}}} and position 𝐏i\mathbf{P}_{i} of a node ii by first aggregating messages from its kk nearest-neighbors. The message is composed out of the neighbor features 𝐕~1:k0:lmax\mathbf{\tilde{V}}_{1:k}^{0:l_{\textrm{max}}} and the relative position between the nodes 𝐏i𝐏~1:k\mathbf{P}_{i}-\tilde{\mathbf{P}}_{1:k}. After updating the features 𝐕i0:lmax\mathbf{V}_{i}^{0:l_{\textrm{max}}}, we project vectors 𝐕il=1\mathbf{V}_{i}^{l=1} to predict an update to the position 𝐏i\mathbf{P}_{i}. (b) In the Sequence Convolution, we concatenate the feature representations of sequence neighbors 𝐕~1:K0:lmax\mathbf{\tilde{V}}_{1:K}^{0:l_{\textrm{max}}} along with spherical harmonic signals of their relative positions 𝐏i𝐏j\mathbf{P}_{i}-\mathbf{P}_{j}, for i[1,K]i\in[1,K], j[1,K]j\in[1,K]. The concatenated feature vector is then linearly projected to the output dimensionality. To produce a coarsened position, each 𝐕~1:K0:lmax\mathbf{\tilde{V}}_{1:K}^{0:l_{\textrm{max}}} produces a score that is used as weight in summing the original positions 𝐏1:K\mathbf{P}_{1:K}. The Softmax function is necessary to ensure the sum of the weights is normalized.

A.5 Roto-Translation Equivariance of Sequence Convolution

A Sequence Convolution kernel takes in KK coordinates 𝐏1:K\mathbf{P}_{1:K} to produce a single coordinate 𝐏=i=1Kwi𝐏i\mathbf{P}=\sum_{i=1}^{K}w_{i}\mathbf{P}_{i}. We show that these weights need to be normalized in order for translation equivariance to be satisfied. Let TT denote a 3D translation vector, then translation equivariance requires:

(𝐏+T)=i=1Kwi(𝐏i+T)=i=1Kwi𝐏i+i=1KwiTi=1Kwi=1(\mathbf{P}+T)=\sum_{i=1}^{K}w_{i}(\mathbf{P}_{i}+T)=\sum_{i=1}^{K}w_{i}\mathbf{P}_{i}+\sum_{i=1}^{K}w_{i}T\;\;\rightarrow\;\;\sum_{i=1}^{K}w_{i}=1

Rotation equivariance is immediately satisfied since the sum of 3D vectors is a rotation equivariant operation. Let RR denote a rotation matrix. Then,

(R𝐏)=Ri=1Kwi(𝐏i)=i=1Kwi(R𝐏i)(R\mathbf{P})=R\sum_{i=1}^{K}w_{i}(\mathbf{P}_{i})=\sum_{i=1}^{K}w_{i}(R\mathbf{P}_{i})

A.6 Transpose Sequence Convolution

Given a single coarse anchor position and features representation (𝐏,𝐕l=0:lmax)(\mathbf{P},\mathbf{V}^{l=0:l_{\textrm{max}}}), we first map 𝐕l=0:lmax\mathbf{V}^{l=0:l_{\textrm{max}}} into a new representation, reshaping it by chunking KK features. We then project 𝐕l=1\mathbf{V}^{l=1} and produce KK relative position vectors Δ𝐏1:K\Delta\mathbf{P}_{1:K}, which are summed with the original position 𝐏\mathbf{P} to produce KK new coordinates.

Input: Kernel Size KK
Input: Latent Representations (𝐏,𝐕0:lmax)(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})
Δ𝐏1:KLinear(𝐕l=1)\Delta\mathbf{P}_{1:K}\leftarrow\textrm{Linear}\left(\mathbf{V}^{l=1}\right)
  \triangleright Predict KK relative positions
𝐏1:K𝐏+Δ𝐏1:K\mathbf{P}_{1:K}\leftarrow\mathbf{P}+\Delta\mathbf{P}_{1:K}
𝐕1:Kl=0:lmaxReshapeK(Linear(𝐕l=0:lmax))\mathbf{V}^{l=0:l_{\textrm{max}}}_{1:K}\leftarrow\textrm{Reshape}_{K}\left(\textrm{Linear}(\mathbf{V}^{l=0:l_{\textrm{max}}})\right)
  \triangleright Split the channels in KK chunks
return (𝐏,𝐕0:lmax)1:K(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})_{1:K}
Algorithm 6 Transpose Sequence Convolution

This procedure generates windows of KK representations and positions. These windows may intersect in the decoded output. We resolve those superpositions by taking the average position and average representations within intersections.

A.7 Layer Normalization and Residual Connections

Training deep models can be challenging due to vanishing or exploding gradients. We employ layer normalization [Ba et al. (2016)] and residual connections [He et al. (2015)] in order to tackle those challenges. We incorporate layer normalization and residual connections at the end of every Self-Interaction and every convolution. To keep roto-translation equivariance, we use the layer normalization described in Equiformer [Liao & Smidt (2023)], which rescales the ll signals independetly within a representation by using the root mean square value of the vectors. We found both residuals and layer norms to be critical in training deep models for large proteins.

A.8 Encoder-Decoder

Below we describe the encoder/decoder algorithm using the building blocks previously introduced.

Input: 𝐂α\mathbf{C}_{\alpha} Position 𝐏α1×3\mathbf{P}^{\alpha}\in\mathbb{R}^{1\times 3}
Input: All-Atom Relative Positions 𝐏n×3\mathbf{P}^{\ast}\in\mathbb{R}^{n\times 3}
Input: Residue Label 𝐑\mathbf{R}\in\mathcal{R}
Output: Latent Representation (𝐏,𝐕l=0:lmax)(\mathbf{P},\mathbf{V}^{l=0:l_{\textrm{max}}})
(𝐏,𝐕0:2)All-Atom Encoding(𝐏α,𝐏,𝐑)(\mathbf{P},\mathbf{V}^{0:2})\leftarrow\textrm{All-Atom Encoding}(\mathbf{P}^{\alpha},\mathbf{P}^{\ast},\mathbf{R})
  \triangleright Initial Residue Encoding
(𝐏,𝐕0:lmax)Linear(𝐏,𝐕0:2)(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})\leftarrow\textrm{Linear}(\mathbf{P},\mathbf{V}^{0:2})
for i1i\leftarrow 1 to LL do
       (𝐏,𝐕0:lmax)Self-Interaction(𝐏,𝐕0:lmax)(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})\leftarrow\textrm{Self-Interaction}(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})
       (𝐏,𝐕0:lmax)Spatial Convolution(𝐏,𝐕0:lmax)(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})\leftarrow\textrm{Spatial Convolution}(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})
       (𝐏,𝐕0:lmax)Sequence Convolution(𝐏,𝐕0:lmax)(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})\leftarrow\textrm{Sequence Convolution}(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})
      
end for
return (𝐏,𝐕0:lmax)\mathbf{(}\mathbf{P},\mathbf{V}^{0:{l_{\max}}})
Algorithm 7 Encoder
Input: Latent Representation (𝐏,𝐕0:lmax)\textrm{Latent Representation }(\mathbf{P},\mathbf{V}^{0:{l_{\max}}})
Output: Protein Structure and Sequence Logits (𝐏^α,𝐏^,)(\hat{\mathbf{P}}^{\alpha},\hat{\mathbf{P}}^{\ast},\bm{\ell})
for i1i\leftarrow 1 to LL do
       (𝐏,𝐕0:lmax)Self-Interaction(𝐏,𝐕0:lmax)(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})\leftarrow\textrm{Self-Interaction}(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})
       (𝐏,𝐕0:lmax)Spatial Convolution(𝐏,𝐕0:lmax)(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})\leftarrow\textrm{Spatial Convolution}(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})
       (𝐏,𝐕0:lmax)Transpose Sequence Convolution(𝐏,𝐕0:lmax)(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})\leftarrow\textrm{Transpose Sequence Convolution}(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})
      
end for
(𝐏,𝐕0:2)Linear(𝐏,𝐕0:lmax)(\mathbf{P},\mathbf{V}^{0:2})\leftarrow\textrm{Linear}(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})
(𝐏^α,𝐏^,)All-Atom Decoding(𝐏,𝐕0:2)(\hat{\mathbf{P}}^{\alpha},\hat{\mathbf{P}}^{\ast},\bm{\ell})\leftarrow\textrm{All-Atom Decoding}(\mathbf{P},\mathbf{V}^{0:2})
  \triangleright Final Protein Decoding
return (𝐏^α,𝐏^,)(\hat{\mathbf{P}}^{\alpha},\hat{\mathbf{P}}^{\ast},\bm{\ell})
Algorithm 8 Decoder

A.9 Variable Sequence Length

We handle proteins of different sequence length by setting a maximum size for the model input. During training, proteins that are larger than the maximum size are cropped, and those that are smaller are padded. The boundary residue is given a special token that labels it as the end of the protein. For inference, we crop the tail of the output after the sequence position where this special token is predicted.

A.10 Time Complexity

We analyse the time complexity of a forward pass through the Ophiuchus autoencoder. Let the list (𝐏,𝐕0:lmax)N(\mathbf{P},\mathbf{V}^{0:l_{\textrm{max}}})_{N} be an arbitrary latent state with NN positions and NN geometric representations of dimensionality DD, where for each d[0,D]d\in[0,D] there are geometric features of degree l[0,lmax]l\in[0,l_{\textrm{max}}]. Note that size of a geometric representation grows as O(Dlmax2)O(D\cdot l^{2}_{\textrm{max}}) in memory.

  • The cost of an SO(3)-equivariant linear layer (Fig. 7.b) is O(D2lmax3)O(D^{2}\cdot l_{\textrm{max}}^{3}).

  • The cost of a channel-wise tensor square operation (Fig. 7.c) is O(Dlmax4)O(D\cdot l^{4}_{\textrm{max}}).

  • In Self-Interaction (Alg. 3), we use a tensor square and project it using a linear layer. The time complexity is given by O(N(D2lmax3+Dlmax4))O\left(N\cdot(D^{2}\cdot l_{\textrm{max}}^{3}+D\cdot l_{\textrm{max}}^{4})\right) for the whole protein.

  • In Spatial Convolution (Alg. 5), a node aggregates geometric messages from kk of its neighbors. Resolving the kk nearest neighbors can be efficiently done in O((N+k)logN)O\left((N+k)\log N\right) through the k-d tree data structure [Bentley (1975)]. For each residue, its kk neighbors prepare messages through linear layers, at total cost O(NkD2lmax3)O(N\cdot k\cdot D^{2}\cdot l^{3}_{\textrm{max}}).

  • In a Sequence Convolution (Alg. 4), a kernel stacks KK geometric representations of dimensionality DD and linearly maps them to a new feature of dimensionality ρD\rho\cdot D, where ρ\rho is a rescaling factor, yielding O((KD)(ρD)lmax3)=O(KρD2lmax3)O((K\cdot D)\cdot(\rho\cdot D)\cdot l^{3}_{\textrm{max}})=O(K\cdot\rho\cdot D^{2}\cdot l^{3}_{\textrm{max}}). With length NN and stride SS, the total cost is O(NSKρD2lmax3)O\left(\frac{N}{S}\cdot K\cdot\rho\cdot D^{2}\cdot l^{3}_{\textrm{max}}\right).

  • The cost of an Ophiuchus Block is the sum of the terms above,

    O(Dlmax3N(KSρD+kD))O\left(D\cdot l_{\textrm{max}}^{3}\cdot N\cdot(\frac{K}{S}\cdot\rho D+kD)\right)
  • An Autoencoder (Alg. 7,8) that uses stride SS convolutions for coarsening uses L=O(logS(N))L=O\left(\log_{S}(N)\right) layers to reduce a protein of size NN into a single representation. At depth ii, the dimensionality is given by Di=ρiDD_{i}=\rho^{i}D and the sequence length is given by Ni=NSiN_{i}=\frac{N}{S^{i}}. The time complexity of our Autoencoder is given by geometric sum:

    O(ilogS(N)(lmax3ρiDNSi(Kρi+1D+kρiD)))=O(Nlmax3(Kρ+k)D2ilogS(N)(ρ2S)i)O\left(\sum_{i}^{\log_{S}(N)}\left(l_{\textrm{max}}^{3}\rho^{i}D\frac{N}{S^{i}}(K\rho^{i+1}D+k\rho^{i}D)\right)\right)=O\left(Nl_{\textrm{max}}^{3}(K\rho+k)D^{2}\sum_{i}^{\log_{S}(N)}\left(\frac{\rho^{2}}{S}\right)^{i}\right)

    We are interested in the dependence on the length NN of a protein, therefore, we keep only relevant parameters. Summing the geometric series and using the identity xlogb(a)=alogb(x)x^{\log_{b}(a)}=a^{\log_{b}(x)} we get:

    =O(N(ρ2S)logS(N)+11ρ2S1)={O(N1+logSρ2)for ρ2/S>1O(NlogSN)for ρ2/S=1O(N)for ρ2/S<1=O\left(N\frac{(\frac{\rho^{2}}{S})^{\log_{S}(N)+1}-1}{\frac{\rho^{2}}{S}-1}\right)=\begin{cases}O\left(N^{1+\log_{S}\rho^{2}}\right)&\text{for }\rho^{2}/S>1\\ O\left(N\log_{S}N\right)&\text{for }\rho^{2}/S=1\\ O\left(N\right)&\text{for }\rho^{2}/S<1\end{cases}

    In most of our experiments we operate in the ρ2/S<1\rho^{2}/S<1 regime.

Appendix B Loss Details

B.1 Details on Vector Map Loss

The Huber Loss [Huber (1992)] behaves linearly for large inputs, and quadratically for small ones. It is defined as:

HuberLoss(y,f(x))={12(yf(x))2if |yf(x)|δ,δ|yf(x)|12δ2otherwise.\text{HuberLoss}(y,f(x))=\begin{cases}\frac{1}{2}(y-f(x))^{2}&\text{if }|y-f(x)|\leq\delta,\\ \delta\cdot|y-f(x)|-\frac{1}{2}\delta^{2}&\text{otherwise.}\end{cases}

We found it to significantly improve training stability for large models compared to mean squared error. We use δ=0.5\delta=0.5 for all our experiments.

The vector map loss measures differences of internal vector maps V(𝐏)V(𝐏^)V(\mathbf{P})-V(\hat{\mathbf{P}}) between predicted and ground positions, where V(𝐏)i,j=(𝐏i𝐏j)×3V(\mathbf{P})^{i,j}=(\mathbf{P}^{i}-\mathbf{P}^{j})\in\mathbb{R}^{\times 3}. Our output algorithm for decoding atoms produces arbitrary symmetry breaks (Appendix A.2) for positions 𝐏v\mathbf{P}_{v}^{*} and 𝐏u\mathbf{P}_{u}^{*} of atoms that are not orderable. Because of that, a loss on the vector map is not directly applicable to the output of our model, since the order of the model output might differ from the order of the ground truth data. To solve that, we consider both possible orderings of permutable atoms, and choose the one that minimizes the loss. Solving for the optimal ordering is not feasible for the system as a whole, since the number of permutations to be considered scales exponentially with NN. Instead, we first compute a vector map loss internal to each residue. We consider the alternative order of permutable atoms, and choose the candidate that minimizes this local loss. This ordering is used for the rest of our losses.

B.2 Chemical Losses

We consider bonds, interbond angles, dihedral angles and steric clashes when computing a loss for chemical validity. Let 𝐏NAtom×3\mathbf{P}\in\mathbb{R}^{N_{\textrm{Atom}}\times 3} be the list of atom positions in ground truth data. We denote 𝐏^NAtom×3\hat{\mathbf{P}}\in\mathbb{R}^{N_{\textrm{Atom}}\times 3} as the list of atom positions predicted by the model. For each chemical interaction, we precompute indices of atoms that perform the interaction. For example, for bonds we precompute a list of pairs of atoms that are bonded according to the chemical profile of each residue. Our chemical losses then take form:

Bonds=1||(v,u)𝐏^v𝐏^u2𝐏v𝐏u222\mathcal{L}_{\textrm{Bonds}}=\frac{1}{|\mathcal{B}|}\sum_{(v,u)\in\mathcal{B}}\Big{|}\Big{|}\;\;||\hat{\mathbf{P}}_{v}-\hat{\mathbf{P}}_{u}||_{2}-||\mathbf{P}_{v}-\mathbf{P}_{u}||_{2}\;\;\Big{|}\Big{|}^{2}_{2}

where \mathcal{B} is a list of pair of indices of atoms that form a bond. We compare the distance between bonded atoms in prediction and ground truth data.

Angles=1|𝒜|(v,u,p)𝒜α(𝐏^v,𝐏^u𝐏^p)α(𝐏v,𝐏u,𝐏p)22\mathcal{L}_{\textrm{Angles}}=\frac{1}{|\mathcal{A}|}\sum_{(v,u,p)\in\mathcal{A}}||\alpha(\hat{\mathbf{P}}_{v},\hat{\mathbf{P}}_{u}\hat{\mathbf{P}}_{p})-\alpha\left(\mathbf{P}_{v},\mathbf{P}_{u},\mathbf{P}_{p}\right)||^{2}_{2}

where 𝒜\mathcal{A} is a list of 3-tuples of indices of atoms that are connected through bonds. The tuple takes the form (v,u,p)(v,u,p) where uu is connected to vv and to pp. Here, the function α(,,)\alpha(\cdot,\cdot,\cdot) measures the angle in radians between positions of atoms that are connected through bonds.

Dihedrals=1|𝒟|(v,u,p,q)𝒟τ(𝐏^v,𝐏^u,𝐏^p,𝐏^q)τ(𝐏v,𝐏u,𝐏p,𝐏q)22\mathcal{L}_{\textrm{Dihedrals}}=\frac{1}{|\mathcal{D}|}\sum_{(v,u,p,q)\in\mathcal{D}}||\tau(\hat{\mathbf{P}}_{v},\hat{\mathbf{P}}_{u},\hat{\mathbf{P}}_{p},\hat{\mathbf{P}}_{q})-\tau(\mathbf{P}_{v},\mathbf{P}_{u},\mathbf{P}_{p},\mathbf{P}_{q})||^{2}_{2}

where 𝒟\mathcal{D} is a list of 4-tuples of indices of atoms that are connected by bonds, that is, (v,u,p,q)(v,u,p,q) where (v,u)(v,u), (u,p)(u,p) and (p,q)(p,q) are connected by bonds. Here, the function τ(,,,)\tau(\cdot,\cdot,\cdot,\cdot) measures the dihedral angle in radians.

Clashes=1|𝒞|(v,u)𝒞H(rv+ru𝐏^v𝐏^u2)\mathcal{L}_{\textrm{Clashes}}=\frac{1}{|\mathcal{C}|}\sum_{(v,u)\in\mathcal{C}}H(r_{v}+r_{u}-||\hat{\mathbf{P}}_{v}-\hat{\mathbf{P}}_{u}||_{2})

where HH is a smooth and differentiable Heaviside-like step function, 𝒞\mathcal{C} is a list of pair of indices of atoms that are not bonded, and (rv,rur_{v},r_{u}) are the van der Waals radii of atoms vv, uu.

B.3 Regularization

When training autoencoder models for latent diffusion, we regularize the learned latent space so that representations are amenable to the relevant range scales of the source distribution 𝒩(0,1)\mathcal{N}(0,1). Let 𝐕il\mathbf{V}^{l}_{i} denote the ii-th channel of a vector representation 𝐕l\mathbf{V}^{l}. We regularize the autoencoder latent space by optimizing radial and angular components of our vectors:

reg=iReLU(1𝐕il11)+iij(𝐕il𝐕jl)\mathcal{L}_{\textrm{reg}}=\sum_{i}\textrm{ReLU}(1-||\mathbf{V}_{i}^{l}||^{1}_{1})+\sum_{i}\sum_{i\neq j}(\mathbf{V}_{i}^{l}\cdot\mathbf{V}_{j}^{l})

The first term penalizes vector magnitudes larger than one, and the second term induces vectors to spread angularly. We find these regularizations to significantly help training of the denoising diffusion model.

B.4 Total Loss

We weight the different losses in our pipeline. For the standard training of the autoencoder, we use weights:

=10VectorMap+CrossEntropy+0.1Bonds+0.1LAngles+0.1Dihedrals+10Clashes\mathcal{L}=10\cdot\mathcal{L}_{\textrm{VectorMap}}+\mathcal{L}_{\textrm{CrossEntropy}}+0.1\cdot\mathcal{L}_{\textrm{Bonds}}+0.1\cdot L_{\textrm{Angles}}+0.1\cdot\mathcal{L}_{\textrm{Dihedrals}}+10\cdot\mathcal{L}_{\textrm{Clashes}}

For fine-tuning the model, we increase the weight of chemical losses significantly:

=10VectorMap+CrossEntropy+1.0Bonds+1.0LAngles+1.0Dihedrals+100Clashes\mathcal{L}=10\cdot\mathcal{L}_{\textrm{VectorMap}}+\mathcal{L}_{\textrm{CrossEntropy}}+1.0\cdot\mathcal{L}_{\textrm{Bonds}}+1.0\cdot L_{\textrm{Angles}}+1.0\cdot\mathcal{L}_{\textrm{Dihedrals}}+100\cdot\mathcal{L}_{\textrm{Clashes}}

We find that high weight values for chemical losses at early training may hurt the model convergence, in particular for models that operate on large lengths.

Appendix C Details on Autoencoder Comparison

We implement the comparison EGNN model in its original form, following [Satorras et al. (2022)]. We use kernel size K=3K=3 and stride S=2S=2 for downsampling and upsampling, and follow the procedure described in [Fu et al. (2023)]. For this comparison, we train the models to minimize the residue label cross entropy, and the vector map loss.

Ophiuchus significantly outperforms the EGNN-based architecture. The standard EGNN is SO(3)-equivariant with respect to its positions, however it models features with SO(3)-invariant representations. As part of its message passing, EGNN uses relative vectors between nodes to update positions. However, when downsampling positions, the total number of relative vectors available reduces quadratically, making it increasingly challenging to recover coordinates. Our method instead uses SO(3)-equivariant feature representations, and is able to keep 3D vector information in features as it coarsens positions. Thus, with very few parameters our model is able to encode and recover protein structures.

Appendix D More on Ablation

In addition to all-atom ablation found in Table 2 we conducted a similar ablation study on models trained on backbone only atoms as shown in Table 4. We found that backbone only models performed slightly better on backbone reconstruction. Furthermore, in Fig. 9.a we compare the relative vector loss between ground truth coordinates and the coordinates reconstructed from the autoencoder with respect to different downsampling factors. We average over different trials and channel rescale factors. As expected, we find that for lower downsampling factors the structure reconstruction accuracy is better. In Fig 9.b we similarly plot the residue recovery accuracy with respect to different downsampling factors. Again, we find the expected result, the residue recovery is better for lower downsampling factors.

Notably, the relative change in structure recovery accuracy with respect to different downsampling factors is much lower compared to the relative change in residue recovery accuracy for different downsampling factors. This suggest our model was able to learn much more efficiently a compressed prior for structure as compared to sequence, which coincides with the common knowledge that sequence has high redundancy in biological proteins. In Fig. 9.c we compare the structure reconstrution accuracy across different channel rescaling factors. Interestingly we find that for larger rescaling factors the structure reconstruction accuracy is slightly lower.

However, since we trained only for 10 epochs, it is likely that due to the larger number of model parameters when employing a larger rescaling factor it would take somewhat longer to achieve similar results. Finally, in Fig. 9.d we compare the residue recovery accuracy across different rescaling factors. We see that for higher rescaling factors we get a higher residue recovery rate. This suggests that sequence recovery is highly dependant on the number of model parameters and is not easily capturable by efficient structural models.

Refer to caption
Figure 9: Ablation Training Curves. We plot metrics across 10 training epochs for our ablated models from Table 2. (a-b) compares models across downsampling factors and highlights the tradeoff between downsampling and reconstruction. (c-d) compares different rescaling factors for fixed downsampling at 160.
Table 4: Recovery rates from bottleneck representations - Backbone only
Factor Channels/Layer # Params [1e6] \downarrow Cα\alpha-RMSD (Å) \downarrow GDT-TS \uparrow GDT-HA \uparrow
17 [16, 24, 36] 0.34 0.81 ±\pm 0.31 96 ±\pm 3 81 ±\pm 5
17 [16, 27, 45] 0.38 0.99 ±\pm 0.45 95 ±\pm 3 81 ±\pm 6
17 [16, 32, 64] 0.49 1.03 ±\pm 0.42 92 ±\pm 4 74 ±\pm 6
53 [16, 24, 36, 54] 0.49 0.99 ±\pm 0.38 92 ±\pm 5 74 ±\pm 8
53 [16, 27, 45, 76] 0.67 1.08 ±\pm 0.40 91 ±\pm 6 71 ±\pm 8
53 [16, 32, 64, 128] 1.26 1.02 ±\pm 0.64 92 ±\pm 9 75 ±\pm 11
160 [16, 24, 36, 54, 81] 0.77 1.33 ±\pm 0.42 84 ±\pm 7 63 ±\pm 8
160 [16, 27, 45, 76, 129] 1.34 1.11 ±\pm 0.29 89 ±\pm 4 69 ±\pm 7
160 [16, 32, 64, 128, 256] 3.77 0.90 ±\pm 0.44 94 ±\pm 7 77 ±\pm 9

Appendix E Latent Space Analysis

E.1 Visualization of Latent Space

To visualize the learned latent space of Ophiuchus, we forward 50k samples from the training set through a large 485-length model (Figure 10). We collect the scalar component of bottleneck representations, and use t-SNE to produce 2D points for coordinates. We similarly produce 3D points and use those for coloring. The result is visualized in Figure 10. Visual inspection of neighboring points reveals unsupervised clustering of similar folds and sequences.

Refer to caption
Figure 10: Latent Space Analysis: we qualitatively examine the unsupervised clustering capabilities of our representations through t-SNE.
Refer to caption
Figure 11: Latent Conformational Interpolation. Top: Comparison of chemical validity metrics across interpolated structures of Nuclear Factor of Activated T-Cell (NFAT) Protein (PDB ID: 1S9K and PDB ID: 2O93). Bottom: Comparison of chemical validity metrics across interpolated structures of Maltose-binding periplasmic protein (PDB ID: 4JKM and PDB 6LF3). For both proteins, we plot results for interpolations on the original data space, on the latent space, and on the autoencoder-reconstructed data space.

Appendix F Latent Diffusion Details

We train all Ophiuchus diffusion models with learning rate lr=1×103\textrm{lr}=1\times 10^{-3} for 10,000 epochs. For denoising networks we use Self-Interactions with the chunked-channel tensor square operation (Alg. 3).

Our tested models are trained on two different datasets. The MiniProtein scaffolds dataset consists of 66k all-atom structures of sequence length between 50 and 65, and composes of diverse folds and sequences across 5 secondary structure topologies, and is introduced by [Cao et al. (2022)]. We also train a model on the data curated by [Yim et al. (2023)], which consists of approximately 22k proteins, to compare Ophiuchus to the performance of FrameDiff and RFDiffusion in backbone generation for proteins up to 485 residues.

F.1 Self-Consistency Scores

To compute the scTM scores, we recover 8 sequences using ProteinMPNN for 500 sampled backbones from the tested diffusion models. We used a sampling temperature of 0.1 for ProteinMPNN. Unlike the original work, where the sequences where folded using AlphaFold2, we use OmegaFold [Wu et al. (2022b)] similar to [Lin & AlQuraishi (2023)]. The predicted structures are aligned to the original sampled backbones and TM-Score and RMSD is calculated for each alignment. To calculate the diversity measurement, we hierarchically clustered samples using MaxCluster. Diversity is defined as the number of clusters divided by the total number of samples, as described in [Yim et al. (2023)].

F.2 MiniProtein Model

In Figure (13) we show generated samples from our miniprotein model, and compare the marginal distribution of our predictions and ground truth data. In Figure (14.b) we show the distribution of TM-scores for joint sampling of sequence and all-atom structure by the diffusion model. We produce marginal distributions of generated samples Fig.(14.e) and find them to successfully approximate the densities of ground truth data. To test the robustness of joint sampling of structure and sequence, we compute self-consistency TM-Scores[Trippe et al. (2023)]. 54% of our sampled backbones have scTM scores > 0.5 compared to 77.6% of samples from RFDiffusion. We also sample 100 proteins between 50-65 amino acids in 0.21s compared to 11s taken by RFDiffusion on a RTX A6000.

F.3 Backbone Model

We include metrics on designability in Figure 12.

Refer to caption
Figure 12: Designability of Sampled Backbones . (a) To analyze the designability of our sampled backbones we show a plot of scTM vs plDDT. (b) To analyze the composition of secondary structures in the samples we show a plot of helix percentage in a sample vs beta-sheet percentage
Refer to caption
Figure 13: All-Atom Latent Diffusion. (a) Random samples from an all-atom MiniProtein model. (b) Random Samples from MiniProtein backbone model. (d) Ramachandran plots of sampled (left) and ground (right) distributions. (e) Comparison of Cα\alpha-Cα\alpha distances and all-atom bond lengths between learned (red) and ground (blue) distributions.
Refer to caption
Figure 14: Self-Consistency Template Matching Scores for Ophiuchus Diffusion. (a) Distribution of scTM scores for 500 sampled backbone. (b) Distribution of TM-Scores of jointly sampled backbones and sequences from an all-atom diffusion model and corresponding OmegaFold models. (c) Distribution of average plDDT scores for 500 sampled backbones (d) TM-Score between sampled backbone (green) and OmegaFold structure (blue) (e) TM-Score between sampled backbone (green) and the most confident OmegaFold structure of the sequence recovered from ProteinMPNN (blue)

Appendix G Visualization of All-Atom Reconstructions

Refer to caption
Figure 15: Reconstruction of 128-length all-atom proteins. Model used for visualization reconstructs all-atom protein structures from coarse representations of 120 scalars and 120 3D vectors.

Appendix H Visualization of Backbone Reconstructions

Refer to caption
Figure 16: Reconstruction of 485-length protein backbones. Model used for visualization reconstructs large backbones from coarse representations of 160 scalars and 160 3D vectors.

Appendix I Visualization of Random Backbone Samples

Refer to caption
Figure 17: Random Samples from an Ophiuchus 485-length Backbone Model. Model used for visualization samples SO(3) representations of 160 scalars and 160 3D vectors. We measure 0.46 seconds to sample a protein backbone.