🤖 AI Summary
Subcellular structures exhibit high morphological and spatial variability, causing existing segmentation models—including SAM—to suffer from label bias and insufficient fine-grained detail capture, thereby limiting generalization. To address this, we propose ScSAM: a framework that synergistically integrates SAM’s strong semantic representations with cell-structure priors learned via MAE pretraining. We introduce a feature alignment fusion module to harmonize these complementary cues and design a cosine-similarity-based class prompt encoder that explicitly models inter-class relationships to mitigate training bias induced by data imbalance. Extensive experiments across multiple subcellular image datasets demonstrate that ScSAM consistently outperforms state-of-the-art methods, achieving significant improvements in both segmentation accuracy and cross-dataset generalization.
📝 Abstract
The significant morphological and distributional variability among subcellular components poses a long-standing challenge for learning-based organelle segmentation models, significantly increasing the risk of biased feature learning. Existing methods often rely on single mapping relationships, overlooking feature diversity and thereby inducing biased training. Although the Segment Anything Model (SAM) provides rich feature representations, its application to subcellular scenarios is hindered by two key challenges: (1) The variability in subcellular morphology and distribution creates gaps in the label space, leading the model to learn spurious or biased features. (2) SAM focuses on global contextual understanding and often ignores fine-grained spatial details, making it challenging to capture subtle structural alterations and cope with skewed data distributions. To address these challenges, we introduce ScSAM, a method that enhances feature robustness by fusing pre-trained SAM with Masked Autoencoder (MAE)-guided cellular prior knowledge to alleviate training bias from data imbalance. Specifically, we design a feature alignment and fusion module to align pre-trained embeddings to the same feature space and efficiently combine different representations. Moreover, we present a cosine similarity matrix-based class prompt encoder to activate class-specific features to recognize subcellular categories. Extensive experiments on diverse subcellular image datasets demonstrate that ScSAM outperforms state-of-the-art methods.