DiCLIP: Diffusion Model Enhances CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
📝 Abstract
Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically leverages Class Activation Maps (CAMs) to achieve pixel-level predictions. Recently, Contrastive Language-Image Pre-training (CLIP) has been introduced to generate CAMs in WSSS. However, previous WSSS methods solely adopt CLIP's vision-language paired property for dense localization, neglecting its inherently limited dense knowledge across both visual and text modalities, which renders CAM generation suboptimal. In this work, we propose DiCLIP, a novel WSSS framework that leverages the generative diffusion model to enhance CLIP's dense knowledge across two modalities. Specifically, Visual Correlation Enhancement (VCE) and Text Semantic Augmentation (TSA) modules are proposed for dense prediction enhancement. To improve the spatial awareness of visual features, our VCE module utilizes diffusion's reliable spatial consistency to mitigate the over-smoothing issue in CLIP's attention. It designs the Attention Clustering Refinement (ACR) module to reliably extract diverse correlation maps from the diffusion model. The correlation maps act as a diversity bias for CLIP's self-attention, recursively pushing its visual features towards a more discriminative dense distribution. To augment the semantics of text embeddings, our TSA module argues that a single text modality is insufficient to encompass the variability of visual categories. Thus, we leverage diffusion's generative power to maintain a dynamic key-value cache model, shifting CAM generation from a patch-text matching mechanism to a novel visual knowledge retrieval paradigm. With these enhancements, DiCLIP not only outperforms state-of-the-art methods on PASCAL VOC and MS COCO but also significantly reduces training costs. Code is publicly available at https://github.com/zwyang6/DiCLIP.
Problem

Research questions and friction points this paper is trying to address.

Weakly Supervised Semantic Segmentation
CLIP
Class Activation Maps
Dense Knowledge
Vision-Language Pre-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Model
CLIP
Weakly Supervised Semantic Segmentation
Visual Correlation Enhancement
Text Semantic Augmentation
🔎 Similar Papers
No similar papers found.
Zhiwei Yang
Zhiwei Yang
Guangzhou Institute of Technology, Xidian University, Guangzhou, China
Deep LearningComputer VisionAnomaly Detection
P
Pengfei Song
Shandong Computer Science Center, China
Y
Yucong Meng
Digital Medical Research Center, School of Basic Medical Science, Fudan University; Shanghai Key Lab of Medical Image Computing and Computer Assisted Intervention, Shanghai 200032, China
Kexue Fu
Kexue Fu
City University of Hong Kong
HCIStorytellingCreativityCognitionHuman-AI collaboration
Shuo Wang
Shuo Wang
Fudan University
AI for Multi-Modal MedicineMedical Image AnalysisBiomechanics
Z
Zhijian Song
Zhongshan Hospital, Fudan University, Shanghai 20032, P.R. China; Digital Medical Research Center, School of Basic Medical Science, Fudan University; Shanghai Key Lab of Medical Image Computing and Computer Assisted Intervention, Shanghai 200032, China