VenusX: Unlocking Fine-Grained Functional Understanding of Proteins

πŸ“… 2025-05-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing protein function prediction research predominantly operates at the whole-protein level, lacking mechanistic insight and systematic evaluation for fine-grained functional unitsβ€”such as residues, segments, and domains. Method: We introduce VenusX, the first large-scale, multi-level protein functional benchmark, encompassing six functional unit types: active sites, binding sites, conserved regions, motifs, domains, and epitopes. It supports three-tier functional annotation and functional similarity pairing. VenusX establishes a novel evaluation protocol integrating cross-level, multi-task learning with hybrid and cross-family data splits to assess both in-distribution and out-of-distribution generalization. Contribution/Results: Built from >878K samples across InterPro, BioLiP, and SAbDab, VenusX provides unified benchmarking for protein language models, sequence-structure fusion models, structure-only methods, and alignment tools. All code and data are publicly released, setting a new standard for fine-grained functional mechanism analysis and model knowledge assessment.

Technology Category

Application Category

πŸ“ Abstract
Deep learning models have driven significant progress in predicting protein function and interactions at the protein level. While these advancements have been invaluable for many biological applications such as enzyme engineering and function annotation, a more detailed perspective is essential for understanding protein functional mechanisms and evaluating the biological knowledge captured by models. To address this demand, we introduce VenusX, the first large-scale benchmark for fine-grained functional annotation and function-based protein pairing at the residue, fragment, and domain levels. VenusX comprises three major task categories across six types of annotations, including residue-level binary classification, fragment-level multi-class classification, and pairwise functional similarity scoring for identifying critical active sites, binding sites, conserved sites, motifs, domains, and epitopes. The benchmark features over 878,000 samples curated from major open-source databases such as InterPro, BioLiP, and SAbDab. By providing mixed-family and cross-family splits at three sequence identity thresholds, our benchmark enables a comprehensive assessment of model performance on both in-distribution and out-of-distribution scenarios. For baseline evaluation, we assess a diverse set of popular and open-source models, including pre-trained protein language models, sequence-structure hybrids, structure-based methods, and alignment-based techniques. Their performance is reported across all benchmark datasets and evaluation settings using multiple metrics, offering a thorough comparison and a strong foundation for future research. Code and data are publicly available at https://github.com/ai4protein/VenusX.
Problem

Research questions and friction points this paper is trying to address.

Lack of fine-grained protein function annotation benchmarks
Need for residue-level functional understanding in proteins
Absence of standardized evaluation for diverse protein models
Innovation

Methods, ideas, or system contributions that make the work stand out.

VenusX benchmark for fine-grained protein annotation
Residue, fragment, domain-level functional analysis
Mixed-family and cross-family performance evaluation
πŸ”Ž Similar Papers
No similar papers found.
Y
Yang Tan
Shanghai Jiao Tong University
W
Wenrui Gou
East China University of Science and Technology
Bozitao Zhong
Bozitao Zhong
Shanghai Jiao Tong University
Computational BiologyProtein DesignDeep LearningSynthetic Biology
L
Liang Hong
Shanghai Jiao Tong University
H
Huiqun Yu
East China University of Science and Technology
Bingxin Zhou
Bingxin Zhou
Shanghai Jiao Tong University
Graph Neural NetworksProtein Representation LearningAI4Biology