🤖 AI Summary
In Vision Transformer (ViT)-based weakly supervised semantic segmentation (WSSS), conventional CAM refinement suffers from excessive background noise in deep layers and insufficient semantic activation due to uniform modeling of attention affinities across layers. This stems from “unconstrained over-smoothing”—a newly identified phenomenon wherein unconstrained query-key affinities cause attention over-concentration and CAM distortion.
Method: We propose the Adaptive Re-Activation Mechanism (AReAM), which dynamically perceives deep-layer convergence directions via shallow-layer attention entropy, enabling cross-layer affinity calibration. AReAM jointly suppresses background noise and re-activates critical semantic regions through entropy-driven attention re-weighting and refined CAM reconstruction.
Contribution/Results: Evaluated on PASCAL VOC and COCO, AReAM achieves significant mIoU gains over state-of-the-art WSSS methods. Visualized CAMs exhibit sharper object boundaries and markedly reduced background noise, validating both quantitative and qualitative improvements.
📝 Abstract
Weakly supervised semantic segmentation (WSSS) has recently attracted considerable attention because it requires fewer annotations than fully supervised approaches, making it especially promising for large-scale image segmentation tasks. Although many vision transformer-based methods leverage self-attention affinity matrices to refine Class Activation Maps (CAMs), they often treat each layer's affinity equally and thus introduce considerable background noise at deeper layers, where attention tends to converge excessively on certain tokens (i.e., over-smoothing). We observe that this deep-level attention naturally converges on a subset of tokens, yet unregulated query-key affinity can generate unpredictable activation patterns (undisciplined over-smoothing), adversely affecting CAM accuracy. To address these limitations, we propose an Adaptive Re-Activation Mechanism (AReAM), which exploits shallow-level affinity to guide deeper-layer convergence in an entropy-aware manner, thereby suppressing background noise and re-activating crucial semantic regions in the CAMs. Experiments on two commonly used datasets demonstrate that AReAM substantially improves segmentation performance compared with existing WSSS methods, reducing noise while sharpening focus on relevant semantic regions. Overall, this work underscores the importance of controlling deep-level attention to mitigate undisciplined over-smoothing, introduces an entropy-aware mechanism that harmonizes shallow and deep-level affinities, and provides a refined approach to enhance transformer-based WSSS accuracy by re-activating CAMs.