🤖 AI Summary
Visual autoregressive (AR) image generation models suffer from slow inference, while existing speculative decoding methods exhibit ambiguous candidate selection in latent space due to uniform token probability distributions. To address this, we propose a relaxed speculative decoding framework. Our method introduces two key innovations: (1) a novel relaxation-based acceptance mechanism grounded in latent-token exchangeability, and (2) total variation (TV) distance constraints to rigorously preserve generation fidelity. This is the first work to successfully adapt speculative decoding to visual AR models. Evaluated on the LlamaGen architecture, our approach achieves 1.75× speedup over greedy decoding and 1.82× over random sampling, with no statistically significant degradation in image quality or semantic consistency. The framework bridges the gap between speculative efficiency gains and high-fidelity visual generation, enabling practical acceleration without compromising perceptual or structural integrity.
📝 Abstract
Auto-Regressive (AR) models have recently gained prominence in image generation, often matching or even surpassing the performance of diffusion models. However, one major limitation of AR models is their sequential nature, which processes tokens one at a time, slowing down generation compared to models like GANs or diffusion-based methods that operate more efficiently. While speculative decoding has proven effective for accelerating LLMs by generating multiple tokens in a single forward, its application in visual AR models remains largely unexplored. In this work, we identify a challenge in this setting, which we term extit{token selection ambiguity}, wherein visual AR models frequently assign uniformly low probabilities to tokens, hampering the performance of speculative decoding. To overcome this challenge, we propose a relaxed acceptance condition referred to as LANTERN that leverages the interchangeability of tokens in latent space. This relaxation restores the effectiveness of speculative decoding in visual AR models by enabling more flexible use of candidate tokens that would otherwise be prematurely rejected. Furthermore, by incorporating a total variation distance bound, we ensure that these speed gains are achieved without significantly compromising image quality or semantic coherence. Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding. In specific, compared to a na""ive application of the state-of-the-art speculative decoding, LANTERN increases speed-ups by $mathbf{1.75} imes$ and $mathbf{1.82} imes$, as compared to greedy decoding and random sampling, respectively, when applied to LlamaGen, a contemporary visual AR model.