CAPSUL: A Comprehensive Human Protein Benchmark for Subcellular Localization

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical gap in existing subcellular localization benchmarks—their lack of integration with three-dimensional protein structures—which has hindered the development of structure-aware predictive models. To bridge this gap, we introduce CAPSUL, the first benchmark dataset that unifies high-quality human protein 3D structures, including AlphaFold2-predicted conformations, with expert-annotated fine-grained subcellular localization labels. Through comprehensive evaluation of multimodal models leveraging both sequence and structural information—augmented by reweighted training, a single-label strategy, and attention visualization—we uncover biologically meaningful structure–localization relationships, such as Golgi-associated α-helices. Our results demonstrate that incorporating 3D structural data significantly enhances localization performance and yields highly interpretable models, establishing a new paradigm for data-driven discovery in cell biology.

Technology Category

Application Category

📝 Abstract
Subcellular localization is a crucial biological task for drug target identification and function annotation. Although it has been biologically realized that subcellular localization is closely associated with protein structure, no existing dataset offers comprehensive 3D structural information with detailed subcellular localization annotations, thus severely hindering the application of promising structure-based models on this task. To address this gap, we introduce a new benchmark called $\mathbf{CAPSUL}$, a $\mathbf{C}$omprehensive hum$\mathbf{A}$n $\mathbf{P}$rotein benchmark for $\mathbf{SU}$bcellular $\mathbf{L}$ocalization. It features a dataset that integrates diverse 3D structural representations with fine-grained subcellular localization annotations carefully curated by domain experts. We evaluate this benchmark using a variety of state-of-the-art sequence-based and structure-based models, showcasing the importance of involving structural features in this task. Furthermore, we explore reweighting and single-label classification strategies to facilitate future investigation on structure-based methods for this task. Lastly, we showcase the powerful interpretability of structure-based methods through a case study on the Golgi apparatus, where we discover a decisive localization pattern $α$-helix from attention mechanisms, demonstrating the potential for bridging the gap with intuitive biological interpretability and paving the way for data-driven discoveries in cell biology.
Problem

Research questions and friction points this paper is trying to address.

subcellular localization
protein structure
3D structural information
human protein
benchmark dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

subcellular localization
protein structure
3D structural representation
structure-based model
biological interpretability
🔎 Similar Papers
No similar papers found.
Y
Yicheng Hu
University of Science and Technology of China
Xinyu Lin
Xinyu Lin
National University of Singapore
recommendation
S
Shulin Li
Tsinghua University
W
Wenjie Wang
University of Science and Technology of China
Fengbin Zhu
Fengbin Zhu
National University of Singapore
NLPIRLLMDocument AIAI + Finance
F
Fuli Feng
University of Science and Technology of China