🤖 AI Summary
Existing diffusion models rely on heuristic, theoretically unjustified guidance mechanisms—such as Classifier-Free Guidance (CFG)—which lack reliability in consistently improving generation quality. To address this, we propose Adversarial Sinkhorn Attention Guidance (ASAG): a theoretically grounded framework that reformulates self-attention as a Sinkhorn optimization problem with an adversarial cost function, derived from optimal transport theory. ASAG explicitly attenuates pixel-level query-key similarity, enabling interpretable and principled control over attention alignment. Crucially, it requires no model retraining and is plug-and-play. This work is the first to integrate optimal transport with adversarial attention guidance. Extensive experiments demonstrate significant improvements in fidelity and controllability across diverse tasks—including text-to-image generation, IP-Adapter, and ControlNet—while maintaining computational efficiency and strong generalization capability.
📝 Abstract
Diffusion models have demonstrated strong generative performance when using guidance methods such as classifier-free guidance (CFG), which enhance output quality by modifying the sampling trajectory. These methods typically improve a target output by intentionally degrading another, often the unconditional output, using heuristic perturbation functions such as identity mixing or blurred conditions. However, these approaches lack a principled foundation and rely on manually designed distortions. In this work, we propose Adversarial Sinkhorn Attention Guidance (ASAG), a novel method that reinterprets attention scores in diffusion models through the lens of optimal transport and intentionally disrupt the transport cost via Sinkhorn algorithm. Instead of naively corrupting the attention mechanism, ASAG injects an adversarial cost within self-attention layers to reduce pixel-wise similarity between queries and keys. This deliberate degradation weakens misleading attention alignments and leads to improved conditional and unconditional sample quality. ASAG shows consistent improvements in text-to-image diffusion, and enhances controllability and fidelity in downstream applications such as IP-Adapter and ControlNet. The method is lightweight, plug-and-play, and improves reliability without requiring any model retraining.