🤖 AI Summary
This work addresses the limited spatial accuracy and causal fidelity of attention mechanisms in existing vision models, which often fail to align with true discriminative regions. To this end, the authors propose CAMAL, a novel approach that leverages segmentation masks as supervision signals to regularize attention during training, explicitly guiding it toward relevant regions while suppressing irrelevant ones. Built upon class activation maps, CAMAL integrates segmentation masks into a joint deep reinforcement learning framework, enhancing both spatial alignment and faithfulness of attention without incurring additional inference overhead. Experimental results demonstrate that CAMAL consistently improves alignment performance across various settings, achieving over a 35% gain in attention faithfulness while maintaining or even improving model generalization.
📝 Abstract
Many vision datasets now provide segmentation masks in addition to annotated images to support a wide range of tasks. In this work, we propose Class Activation Map Attention Learning (CAMAL), an efficient and scalable method that utilizes segmentation masks to improve attention alignment and faithfulness in vision models. Specifically, attention alignment refers to the degree to which a model's attention aligns with ground-truth discriminative regions, while attention faithfulness refers to the degree to which a model's attention influences its decision. Improving both attention alignment and faithfulness is essential for ensuring that model attention is both spatially accurate and causally meaningful. To improve attention alignment and faithfulness in vision models, CAMAL first extracts the model's attention for each image during training and then compares the attention to ground-truth discriminative regions obtained from the corresponding segmentation masks. CAMAL then acts as an auxiliary regularizer, encouraging attention that aligns with ground-truth discriminative regions, while suppressing attention elsewhere. We evaluated CAMAL across two learning paradigms -- Deep Learning (DL) and Deep Reinforcement Learning (DRL) -- and observed consistent, significant improvements in both attention alignment and faithfulness. In particular, CAMAL yields statistically significant gains in attention alignment across all settings, and improves attention faithfulness by over 35% compared to recent work. Moreover, we show that improved attention alignment and faithfulness enhance explainability, while yielding improved or comparable generalization performance without increasing inference cost. These findings demonstrate that the spatial information contained within segmentation masks can be effectively leveraged to guide model attention across learning tasks.