π€ AI Summary
To address incomplete pseudo-mask localization and poor cross-tissue semantic consistency in weakly supervised semantic segmentation (WSSS) of histopathological images, this paper proposes a text-guided prototype learning framework. Methodologically, it is the first to integrate CONCHβs morphology-aware vision-language representations with SegFormerβs multi-scale spatial architecture, introducing text-conditioned prototype initialization and structured knowledge distillation to jointly optimize semantic discriminability and spatial coherence without pixel-level annotations. Technically, it adopts a frozen backbone plus lightweight adapter paradigm. Evaluated on the BCSS-WSSS dataset, the method significantly outperforms existing WSSS approaches: it yields more complete pseudo-masks, achieves superior cross-tissue semantic consistency, and maintains high computational efficiency.
π Abstract
Weakly supervised semantic segmentation (WSSS) in histopathology relies heavily on classification backbones, yet these models often localize only the most discriminative regions and struggle to capture the full spatial extent of tissue structures. Vision-language models such as CONCH offer rich semantic alignment and morphology-aware representations, while modern segmentation backbones like SegFormer preserve fine-grained spatial cues. However, combining these complementary strengths remains challenging, especially under weak supervision and without dense annotations. We propose a prototype learning framework for WSSS in histopathological images that integrates morphology-aware representations from CONCH, multi-scale structural cues from SegFormer, and text-guided semantic alignment to produce prototypes that are simultaneously semantically discriminative and spatially coherent. To effectively leverage these heterogeneous sources, we introduce text-guided prototype initialization that incorporates pathology descriptions to generate more complete and semantically accurate pseudo-masks. A structural distillation mechanism transfers spatial knowledge from SegFormer to preserve fine-grained morphological patterns and local tissue boundaries during prototype learning. Our approach produces high-quality pseudo masks without pixel-level annotations, improves localization completeness, and enhances semantic consistency across tissue types. Experiments on BCSS-WSSS datasets demonstrate that our prototype learning framework outperforms existing WSSS methods while remaining computationally efficient through frozen foundation model backbones and lightweight trainable adapters.