Guiding a Diffusion Model by Swapping Its Tokens

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Classifier-Free Guidance cannot be applied to unconditional generation, and conventional perturbation-based methods lack fine-grained control. This work proposes Self-Swap Guidance, a novel approach that generates perturbed predictions during diffusion model inference by swapping the token latents with the largest semantic discrepancy and uses the directional difference between the perturbed and original predictions to guide sampling. This method establishes, for the first time, a fine-grained, plug-and-play guidance mechanism applicable to both conditional and unconditional generation. Evaluated on MS-COCO and ImageNet, it significantly improves image fidelity, prompt alignment, and robustness to guidance strength while effectively mitigating undesirable side effects.
📝 Abstract
Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling towards higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar token latents in either spatial or channel dimensions. Unlike existing methods that apply perturbation in a global or less constrained manner, our approach selectively exchanges and recomposes token latents, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet datasets demonstrate that the proposed Self-Swap Guidance (SSG), when applied to popular diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.
Problem

Research questions and friction points this paper is trying to address.

diffusion models
Classifier-Free Guidance
unconditional generation
image fidelity
token swapping
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Swap Guidance
diffusion models
token swapping
classifier-free guidance
fine-grained perturbation
🔎 Similar Papers
No similar papers found.
W
Weijia Zhang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Y
Yuehao Liu
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
S
Shanyan Guan
vivo Mobile Communication Co., Ltd.
W
Wu Ran
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
Y
Yanhao Ge
vivo Mobile Communication Co., Ltd.
W
Wei Li
vivo Mobile Communication Co., Ltd.
Chao Ma
Chao Ma
Professor, Shanghai Jiao Tong University
Computer visionMachine learningImage processing