Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions

📅 2024-08-29

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study addresses the challenge of modeling protein–nucleic acid interactions, specifically predicting binding free energy (ΔG) and identifying key binding residues. We introduce the first open-source multimodal sequence Transformer foundation model capable of joint self-supervised learning over DNA/RNA and protein sequences—without requiring structural labels—thereby implicitly capturing central dogma constraints and molecular interaction principles. Our method innovates with cross-modal embedding alignment and multimodal joint pretraining. It achieves state-of-the-art performance on both ΔG prediction and binding residue localization, surpassing unimodal baselines in both computational efficiency per unit hardware and absolute accuracy. Furthermore, the model yields biologically interpretable patterns, establishing a new paradigm for cross-molecular-type functional prediction. (128 words)

Technology Category

Application Category

📝 Abstract

The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually DNA/RNA or proteins. These models have seen incredible success in downstream tasks in each domain, and have achieved particularly noteworthy breakthroughs in sequence modeling and structural modeling. However, these single-omic models are naturally incapable of efficiently modeling multi-omic tasks, one of the most biologically critical being protein-nucleic acid interactions. We present our work training the largest open-source multi-omic foundation model to date. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on protein-nucleic acid interaction tasks, namely predicting the change in Gibbs free energy ($Delta G$) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any extit{a priori} structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, we provide evidence that multi-omic biosequence models are in many cases superior to foundation models trained on single-omics distributions, both in performance-per-FLOP and absolute performance, suggesting a more generalized or foundational approach to building these models for biology.

Problem

Research questions and friction points this paper is trying to address.

Modeling protein-nucleic acid interactions using multi-omic transformers

Predicting Gibbs free energy changes in binding interactions

Learning joint representations across DNA/RNA and protein sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-omic transformers model protein-nucleic interactions

Joint representations learned from unlabeled biosequences

State-of-the-art fine-tuning for binding energy prediction

🔎 Similar Papers

Advancing bioinformatics with large language models: components, applications and perspectives