🤖 AI Summary
This work addresses the inefficiency of autoregressive image generation, which suffers from sequential dependencies and ambiguity in image tokens, leading to slow inference. While existing speculative decoding methods struggle to balance speed and generation quality, this paper introduces COOL-SD, the first approach to establish a theoretical foundation for relaxed speculative decoding. By analyzing total variation distance and perturbation behavior, the authors derive an optimal resampling distribution and incorporate an annealing mechanism to dynamically adjust the degree of relaxation during decoding. The proposed method achieves significant acceleration without compromising visual fidelity, consistently outperforming current state-of-the-art techniques across multiple benchmarks and offering a superior trade-off between inference speed and generation quality.
📝 Abstract
Despite significant progress in autoregressive image generation, inference remains slow due to the sequential nature of AR models and the ambiguity of image tokens, even when using speculative decoding. Recent works attempt to address this with relaxed speculative decoding but lack theoretical grounding. In this paper, we establish the theoretical basis of relaxed SD and propose COOL-SD, an annealed relaxation of speculative decoding built on two key insights. The first analyzes the total variation (TV) distance between the target model and relaxed speculative decoding and yields an optimal resampling distribution that minimizes an upper bound of the distance. The second uses perturbation analysis to reveal an annealing behaviour in relaxed speculative decoding, motivating our annealed design. Together, these insights enable COOL-SD to generate images faster with comparable quality, or achieve better quality at similar latency. Experiments validate the effectiveness of COOL-SD, showing consistent improvements over prior methods in speed-quality trade-offs.