🤖 AI Summary
This work addresses the issue of blurred and detail-deficient samples in unguided generation with pretrained flow models, which arises from the smoothing effect of neural networks. While existing guidance methods like classifier-free guidance (CFG) improve fidelity, they incur substantial computational overhead and compromise sample diversity. To overcome these limitations, we propose Momentum Guidance (MG), a plug-and-play technique that leverages historical velocity information along the ODE trajectory via exponential moving average to extrapolate the current velocity—introducing no additional computational cost. MG can be used independently or combined with CFG, achieving an average FID improvement of 36.68% without CFG and 25.52% with CFG on ImageNet-256 (reaching an FID of 1.597 with 64-step sampling). Furthermore, MG consistently enhances both generation quality and diversity across large-scale models such as Stable Diffusion 3 and FLUX.1-dev.
📝 Abstract
Flow-based generative models have become a strong framework for high-quality generative modeling, yet pretrained models are rarely used in their vanilla conditional form: conditional samples without guidance often appear diffuse and lack fine-grained detail due to the smoothing effects of neural networks. Existing guidance techniques such as classifier-free guidance (CFG) improve fidelity but double the inference cost and typically reduce sample diversity. We introduce Momentum Guidance (MG), a new dimension of guidance that leverages the ODE trajectory itself. MG extrapolates the current velocity using an exponential moving average of past velocities and preserves the standard one-evaluation-per-step cost. It matches the effect of standard guidance without extra computation and can further improve quality when combined with CFG. Experiments demonstrate MG's effectiveness across benchmarks. Specifically, on ImageNet-256, MG achieves average improvements in FID of 36.68% without CFG and 25.52% with CFG across various sampling settings, attaining an FID of 1.597 at 64 sampling steps. Evaluations on large flow-based models like Stable Diffusion 3 and FLUX.1-dev further confirm consistent quality enhancements across standard metrics.