🤖 AI Summary
This work addresses the issue of attention overlap—termed concept leakage—caused by visually confusable concepts in multimodal diffusion Transformers. To mitigate this without additional training, the authors propose a novel concept localization method that decouples semantic localization from structural refinement. The approach identifies high-confidence anchor points in concept-image attention maps and propagates them over a hybrid graph constructed from fused image self-attention. Key innovations include an anchor propagation mechanism, an output-space similarity metric, and row-wise attention gating, which collectively suppress spurious cross-object connections and substantially alleviate concept leakage. The study also introduces the first benchmark dataset tailored for evaluating multi-concept confusion scenarios. Experiments demonstrate superior localization performance on ImageNet-Segmentation and PascalVOC, along with a significant reduction in concept leakage on the newly curated dataset.
📝 Abstract
Multi-Modal Diffusion Transformers (MM-DiTs) encode rich representations for training-free concept grounding, but existing attention-based methods often produce overlapping activations on visually confusable concepts, a failure mode we call concept leakage, where target responses spill over to non-target objects. To address this issue, we propose AnchorDiff, a training-free grounding method that decouples semantic localization from structural refinement. AnchorDiff selects a high-confidence anchor from concept-to-image attention map and propagates it as a one-hot seed over a hybrid graph derived from image-to-image self-attention. The graph uses output-space similarity for dense within-object propagation and a row-wise attention gate to suppress cross-object connections. Additionally, we introduce the Multi-Concept Confusion Dataset, which contains images with multiple visually similar concepts and separate masks, enabling explicit evaluation of concept leakage. Experiments show that AnchorDiff achieves strong grounding performance on ImageNet-Segmentation and PascalVOC, while substantially reducing concept leakage on our Multi-Concept Confusion Dataset.