🤖 AI Summary
This work addresses the critical gap in existing subcellular localization benchmarks—their lack of integration with three-dimensional protein structures—which has hindered the development of structure-aware predictive models. To bridge this gap, we introduce CAPSUL, the first benchmark dataset that unifies high-quality human protein 3D structures, including AlphaFold2-predicted conformations, with expert-annotated fine-grained subcellular localization labels. Through comprehensive evaluation of multimodal models leveraging both sequence and structural information—augmented by reweighted training, a single-label strategy, and attention visualization—we uncover biologically meaningful structure–localization relationships, such as Golgi-associated α-helices. Our results demonstrate that incorporating 3D structural data significantly enhances localization performance and yields highly interpretable models, establishing a new paradigm for data-driven discovery in cell biology.
📝 Abstract
Subcellular localization is a crucial biological task for drug target identification and function annotation. Although it has been biologically realized that subcellular localization is closely associated with protein structure, no existing dataset offers comprehensive 3D structural information with detailed subcellular localization annotations, thus severely hindering the application of promising structure-based models on this task. To address this gap, we introduce a new benchmark called $\mathbf{CAPSUL}$, a $\mathbf{C}$omprehensive hum$\mathbf{A}$n $\mathbf{P}$rotein benchmark for $\mathbf{SU}$bcellular $\mathbf{L}$ocalization. It features a dataset that integrates diverse 3D structural representations with fine-grained subcellular localization annotations carefully curated by domain experts. We evaluate this benchmark using a variety of state-of-the-art sequence-based and structure-based models, showcasing the importance of involving structural features in this task. Furthermore, we explore reweighting and single-label classification strategies to facilitate future investigation on structure-based methods for this task. Lastly, we showcase the powerful interpretability of structure-based methods through a case study on the Golgi apparatus, where we discover a decisive localization pattern $α$-helix from attention mechanisms, demonstrating the potential for bridging the gap with intuitive biological interpretability and paving the way for data-driven discoveries in cell biology.