๐ค AI Summary
To address key challenges in weakly supervised semantic segmentation of histopathological imagesโincluding inter-class homogeneity, intra-class heterogeneity, and CAM region shrinkageโthis paper proposes a text-image dual-modal prototype collaborative learning framework. We introduce learnable textual prompts (inspired by CoOp) to generate semantic prototypes, which jointly constitute a dynamic dual-prototype bank with visual prototypes. A multi-scale pyramid module is designed to mitigate excessive feature smoothing in Vision Transformers (ViTs) and enhance localization accuracy. Additionally, we integrate contrastive vision-language alignment with an improved CAM supervision strategy. Evaluated on the BCSS-WSSS benchmark, our method significantly outperforms state-of-the-art approaches. Ablation studies confirm that the diversity of textual prompts, contextual modeling capability, and complementary synergy between textual and visual prototypes are critical for segmentation performance gains.
๐ Abstract
Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image-level labels, yet it remains limited by inter-class homogeneity, intra-class heterogeneity, and the region-shrinkage effect of CAM-based supervision. We propose a simple and effective prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision. Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi-scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS-WSSS benchmark show that our approach surpasses existing state-of-the-art methods, and detailed analyses demonstrate the benefits of text description diversity, context length, and the complementary behavior of text and image prototypes. These results highlight the effectiveness of jointly leveraging textual semantics and visual prototype learning for WSSS in digital pathology.