Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Existing diffusion models rely on heuristic, theoretically unjustified guidance mechanisms—such as Classifier-Free Guidance (CFG)—which lack reliability in consistently improving generation quality. To address this, we propose Adversarial Sinkhorn Attention Guidance (ASAG): a theoretically grounded framework that reformulates self-attention as a Sinkhorn optimization problem with an adversarial cost function, derived from optimal transport theory. ASAG explicitly attenuates pixel-level query-key similarity, enabling interpretable and principled control over attention alignment. Crucially, it requires no model retraining and is plug-and-play. This work is the first to integrate optimal transport with adversarial attention guidance. Extensive experiments demonstrate significant improvements in fidelity and controllability across diverse tasks—including text-to-image generation, IP-Adapter, and ControlNet—while maintaining computational efficiency and strong generalization capability.

Technology Category

Application Category

📝 Abstract

Diffusion models have demonstrated strong generative performance when using guidance methods such as classifier-free guidance (CFG), which enhance output quality by modifying the sampling trajectory. These methods typically improve a target output by intentionally degrading another, often the unconditional output, using heuristic perturbation functions such as identity mixing or blurred conditions. However, these approaches lack a principled foundation and rely on manually designed distortions. In this work, we propose Adversarial Sinkhorn Attention Guidance (ASAG), a novel method that reinterprets attention scores in diffusion models through the lens of optimal transport and intentionally disrupt the transport cost via Sinkhorn algorithm. Instead of naively corrupting the attention mechanism, ASAG injects an adversarial cost within self-attention layers to reduce pixel-wise similarity between queries and keys. This deliberate degradation weakens misleading attention alignments and leads to improved conditional and unconditional sample quality. ASAG shows consistent improvements in text-to-image diffusion, and enhances controllability and fidelity in downstream applications such as IP-Adapter and ControlNet. The method is lightweight, plug-and-play, and improves reliability without requiring any model retraining.

Problem

Research questions and friction points this paper is trying to address.

Enhancing diffusion sampling reliability through adversarial attention guidance

Replacing heuristic perturbations with principled optimal transport disruption

Improving conditional sample quality without requiring model retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial Sinkhorn Attention Guidance reinterprets attention via optimal transport

Method injects adversarial cost to reduce query-key similarity

Lightweight plug-and-play approach enhances reliability without retraining

🔎 Similar Papers

Diffusion Models: A Comprehensive Survey of Methods and Applications