🤖 AI Summary
To address the challenge of integrating heterogeneous protein modalities, this paper introduces OneProt—the first multimodal foundation model unifying protein sequences, 3D structures, textual descriptions, and binding sites. Methodologically, it employs a lightweight paired alignment strategy to harmonize disparate modality representations, proposes the first binding-site-specific encoder, and integrates ImageBind, graph neural networks (GNNs), and Transformers into a hybrid architecture. Key contributions include: (1) a novel paired alignment paradigm for protein multimodal representation learning; (2) the first empirical validation that binding-site representations are indispensable for functional prediction; and (3) cross-modal transfer enabling evolutionarily related proteins to cluster directionally in latent space. On multimodal benchmarks, OneProt significantly outperforms unimodal baselines across enzyme function prediction, binding site identification, and evolutionary relationship discrimination—establishing a scalable, general-purpose foundation model for drug discovery and protein engineering.
📝 Abstract
Recent advances in Artificial Intelligence have enabled multi-modal systems to model and translate diverse information spaces. Extending beyond text and vision, we introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, text, and binding site data. Using the ImageBind framework, OneProt aligns the latent spaces of protein modality encoders in a lightweight fine-tuning scheme that focuses on pairwise alignment with sequence data rather than requiring full matches. This novel approach comprises a mix of Graph Neural Networks and transformer architectures. It demonstrates strong performance in retrieval tasks and showcases the efficacy of multi-modal systems in Protein Machine Learning through a broad spectrum of downstream baselines, including enzyme function prediction and binding site analysis. Furthermore, OneProt enables the transfer of representational information from specialized encoders to the sequence encoder, enhancing capabilities for distinguishing evolutionarily related and unrelated sequences and exhibiting representational properties where evolutionarily related proteins align in similar directions within the latent space. In addition, we extensively investigate modality ablations to identify the encoders that contribute most to predictive performance, highlighting the significance of the binding site encoder, which has not been used in similar models previously. This work expands the horizons of multi-modal protein models, paving the way for transformative applications in drug discovery, biocatalytic reaction planning, and protein engineering.