VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the limitations of fixed-scale classifier-free guidance (CFG) in diffusion models, which fails to account for the time-varying nature of semantic signal strength and dynamic consistency during the generation process, often resulting in insufficient structural fidelity and semantic alignment in image synthesis and editing. To overcome this, the authors propose Velocity-Adaptive Guidance Scale (VAGS), a novel approach that dynamically modulates the guidance strength at each sampling step by leveraging the cosine similarity between task-specific velocity fields and time-aware signal magnitudes. Notably, VAGS requires no additional training, fine-tuning, or auxiliary networks. Extensive experiments across multiple benchmarks—including PIE-Bench, DIV2K, COCO17, CUB-200, and Flickr30K—demonstrate that VAGS consistently outperforms fixed CFG and other training-free baselines, achieving significant improvements in both structural preservation and overall generation quality.

📝 Abstract

Classifier-free guidance (CFG) is the primary control over how strongly text semantics move a flow-based sampler, yet standard practice holds its scale fixed across the entire ODE trajectory. This is a fundamental mismatch: early steps are noise-dominated and carry weak semantic signal, while late steps commit image structure and demand stronger directional commitment; more critically, the value of any guidance strength depends on whether the guided velocity is consistent with the model's current dynamics or working against them. We propose \textit{Velocity-Adaptive Guidance Scale} (VAGS), a training-free replacement that multiplies the nominal scale by a bounded factor combining a temporal signal-level term with the cosine similarity between task-relevant velocity fields. For inversion-free editing, VAGS measures the alignment between source- and target-guided velocities, so edit strength at each step reflects local compatibility between preservation and transformation. For generation, VAGS-Gen uses the alignment between unconditional and conditional velocities as the analogous signal. Neither variant requires fine-tuning, auxiliary networks, or extra forward passes, and fixed CFG is recovered as a special case. On PIE-Bench and DIV2K for editing, and COCO17, CUB-200, and Flickr30K for generation, VAGS consistently improves structural fidelity and generation quality over fixed CFG and recent training-free guidance variants. The code is publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale.

Problem

Research questions and friction points this paper is trying to address.

Classifier-free guidance

Image editing

Image generation

Guidance scale

Flow-based sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Velocity-Adaptive Guidance

Classifier-Free Guidance

Flow-Based Generation