🤖 AI Summary
Text-to-image (T2I) diffusion models suffer from semantic leakage—unintended cross-entity semantic associations arising from excessive attention-based interactions among distinct entities. To address this, we propose a lightweight, training-free inference-time intervention that dynamically reweights self-attention maps during the denoising process to suppress inter-entity semantic leakage. This is the first work to directly modulate attention mechanisms at inference time for mitigating semantic leakage. We introduce SLIM, the first dedicated benchmark dataset for semantic leakage evaluation, along with an automated assessment framework. Extensive experiments demonstrate that our method significantly outperforms existing baselines across diverse scenarios, effectively reducing semantic leakage while preserving image quality and fidelity—without requiring additional inputs or model fine-tuning.
📝 Abstract
Text-to-Image (T2I) models have advanced rapidly, yet they remain vulnerable to semantic leakage, the unintended transfer of semantically related features between distinct entities. Existing mitigation strategies are often optimization-based or dependent on external inputs. We introduce DeLeaker, a lightweight, optimization-free inference-time approach that mitigates leakage by directly intervening on the model's attention maps. Throughout the diffusion process, DeLeaker dynamically reweights attention maps to suppress excessive cross-entity interactions while strengthening the identity of each entity. To support systematic evaluation, we introduce SLIM (Semantic Leakage in IMages), the first dataset dedicated to semantic leakage, comprising 1,130 human-verified samples spanning diverse scenarios, together with a novel automatic evaluation framework. Experiments demonstrate that DeLeaker consistently outperforms all baselines, even when they are provided with external information, achieving effective leakage mitigation without compromising fidelity or quality. These results underscore the value of attention control and pave the way for more semantically precise T2I models.