🤖 AI Summary
In weakly supervised semantic segmentation (WSSS) of medical images, class activation maps (CAMs) suffer from severe localization bias and blurred boundaries, while conditional diffusion models (CDMs) often generate saliency maps contaminated by background noise. To address these issues, this paper proposes a novel segmentation framework that integrates a frozen CDM with pixel-wise contrastive learning. Specifically, the CDM is leveraged to extract robust feature representations; a pixel embedding space is constructed by jointly incorporating external classifier gradient maps and CAMs; and a contrastive learning–based decoder is designed to enhance foreground-background discrimination. Evaluated on four segmentation tasks across two public medical image datasets, the method consistently outperforms state-of-the-art WSSS baselines, achieving significant improvements in both segmentation accuracy and boundary delineation. These results validate the effectiveness of synergistically modeling diffusion priors and contrastive learning for weakly supervised medical image segmentation.
📝 Abstract
Weakly supervised semantic segmentation (WSSS) methods using class labels often rely on class activation maps (CAMs) to localize objects. However, traditional CAM-based methods struggle with partial activations and imprecise object boundaries due to optimization discrepancies between classification and segmentation. Recently, the conditional diffusion model (CDM) has been used as an alternative for generating segmentation masks in WSSS, leveraging its strong image generation capabilities tailored to specific class distributions. By modifying or perturbing the condition during diffusion sampling, the related objects can be highlighted in the generated images. Yet, the saliency maps generated by CDMs are prone to noise from background alterations during reverse diffusion. To alleviate the problem, we introduce Contrastive Learning with Diffusion Features (CLDF), a novel method that uses contrastive learning to train a pixel decoder to map the diffusion features from a frozen CDM to a low-dimensional embedding space for segmentation. Specifically, we integrate gradient maps generated from CDM external classifier with CAMs to identify foreground and background pixels with fewer false positives/negatives for contrastive learning, enabling robust pixel embedding learning. Experimental results on four segmentation tasks from two public medical datasets demonstrate that our method significantly outperforms existing baselines.