🤖 AI Summary
Text-to-image diffusion models struggle to suppress content strongly associated with specific concepts (e.g., “Charlie Chaplin” invariably generates “mustache”), as existing methods lack fine-grained intervention capability in the text embedding space to disentangle semantically entangled features. To address this, we propose Selective Semantic Disentanglement via Vectorization (SSDV), a method that introduces learnable incremental vectors into the cross-attention mechanism to selectively attenuate the semantic contribution of target tokens. These vectors are optimized via translation in the text embedding space and can be obtained zero-shot—without model fine-tuning or additional training. SSDV is the first approach enabling token-level content suppression in personalized T2I models, effectively mitigating strong semantic entanglement. Experiments demonstrate that SSDV outperforms state-of-the-art methods both quantitatively (lower FID, higher CLIP-Score) and qualitatively, especially in suppressing high-frequency co-occurring attributes.
📝 Abstract
Text-to-Image (T2I) diffusion models have made significant progress in generating diverse high-quality images from textual prompts. However, these models still face challenges in suppressing content that is strongly entangled with specific words. For example, when generating an image of ``Charlie Chaplin", a ``mustache" consistently appears even if explicitly instructed not to include it, as the concept of ``mustache" is strongly entangled with ``Charlie Chaplin". To address this issue, we propose a novel approach to directly suppress such entangled content within the text embedding space of diffusion models. Our method introduces a delta vector that modifies the text embedding to weaken the influence of undesired content in the generated image, and we further demonstrate that this delta vector can be easily obtained through a zero-shot approach. Furthermore, we propose a Selective Suppression with Delta Vector (SSDV) method to adapt delta vector into the cross-attention mechanism, enabling more effective suppression of unwanted content in regions where it would otherwise be generated. Additionally, we enabled more precise suppression in personalized T2I models by optimizing delta vector, which previous baselines were unable to achieve. Extensive experimental results demonstrate that our approach significantly outperforms existing methods, both in terms of quantitative and qualitative metrics.