Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions

📅 2024-08-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
This study addresses the challenge of modeling protein–nucleic acid interactions, specifically predicting binding free energy (ΔG) and identifying key binding residues. We introduce the first open-source multimodal sequence Transformer foundation model capable of joint self-supervised learning over DNA/RNA and protein sequences—without requiring structural labels—thereby implicitly capturing central dogma constraints and molecular interaction principles. Our method innovates with cross-modal embedding alignment and multimodal joint pretraining. It achieves state-of-the-art performance on both ΔG prediction and binding residue localization, surpassing unimodal baselines in both computational efficiency per unit hardware and absolute accuracy. Furthermore, the model yields biologically interpretable patterns, establishing a new paradigm for cross-molecular-type functional prediction. (128 words)

Technology Category

Application Category

📝 Abstract
The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually DNA/RNA or proteins. These models have seen incredible success in downstream tasks in each domain, and have achieved particularly noteworthy breakthroughs in sequence modeling and structural modeling. However, these single-omic models are naturally incapable of efficiently modeling multi-omic tasks, one of the most biologically critical being protein-nucleic acid interactions. We present our work training the largest open-source multi-omic foundation model to date. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on protein-nucleic acid interaction tasks, namely predicting the change in Gibbs free energy ($Delta G$) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any extit{a priori} structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, we provide evidence that multi-omic biosequence models are in many cases superior to foundation models trained on single-omics distributions, both in performance-per-FLOP and absolute performance, suggesting a more generalized or foundational approach to building these models for biology.
Problem

Research questions and friction points this paper is trying to address.

Modeling protein-nucleic acid interactions using multi-omic transformers
Predicting Gibbs free energy changes in binding interactions
Learning joint representations across DNA/RNA and protein sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-omic transformers model protein-nucleic interactions
Joint representations learned from unlabeled biosequences
State-of-the-art fine-tuning for binding energy prediction
💼 Related Jobs
S
Sully F. Chen
Duke University School of Medicine, Durham, NC 27710, USA
R
Robert J. Steele
NYU Langone Health, New York, NY 10016, USA
B
Beakal Lemeneh
NYU Langone Health, New York, NY 10016, USA
S
Shivanand P. Lad
Duke University School of Medicine, Department of Neurological Surgery, Durham, NC 27710, USA
E
Eric K. Oermann
NYU Langone Health, Department of Neurological Surgery, New York, NY 10016, USA
G
Glen M. Hocky