🤖 AI Summary
Autoregressive image generation via diffusion models suffers from high inference latency due to excessive denoising steps (50–100 per token). Method: This paper identifies, for the first time, that token distributions concentrate and denoising trajectories become increasingly linear in later generation stages. Leveraging this insight, we propose DiSA—a training-free, dynamic step annealing mechanism that adaptively schedules denoising steps via MLP-based prediction, variance estimation, and denoising trajectory monitoring. DiSA is orthogonal to existing diffusion acceleration techniques. Results: DiSA achieves 5–10× speedup on MAR and Harmon, and 1.4–2.5× on FlowAR and xAR, with no degradation in generation quality. It requires only a few lines of code for integration. Our core contribution is the discovery of the evolutionary规律 of denoising paths in autoregressive generation and the design of the first lightweight, training-free, plug-and-play diffusion step adaptation strategy.
📝 Abstract
An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves $5-10 imes$ faster inference for MAR and Harmon and $1.4-2.5 imes$ for FlowAR and xAR, while maintaining the generation quality.