Guiding a Diffusion Model by Swapping Its Tokens

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing Classifier-Free Guidance cannot be applied to unconditional generation, and conventional perturbation-based methods lack fine-grained control. This work proposes Self-Swap Guidance, a novel approach that generates perturbed predictions during diffusion model inference by swapping the token latents with the largest semantic discrepancy and uses the directional difference between the perturbed and original predictions to guide sampling. This method establishes, for the first time, a fine-grained, plug-and-play guidance mechanism applicable to both conditional and unconditional generation. Evaluated on MS-COCO and ImageNet, it significantly improves image fidelity, prompt alignment, and robustness to guidance strength while effectively mitigating undesirable side effects.

Technology Category

Application Category

📝 Abstract

Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling towards higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar token latents in either spatial or channel dimensions. Unlike existing methods that apply perturbation in a global or less constrained manner, our approach selectively exchanges and recomposes token latents, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet datasets demonstrate that the proposed Self-Swap Guidance (SSG), when applied to popular diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.

Problem

Research questions and friction points this paper is trying to address.

diffusion models

Classifier-Free Guidance

unconditional generation

image fidelity

token swapping

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Swap Guidance

diffusion models

token swapping