🤖 AI Summary
Autoregressive (AR) visual generation suffers from prohibitively slow inference due to sequential token-by-token decoding. Existing Speculative Jacobi Decoding (SJD) methods sample draft tokens independently, causing inter-iteration inconsistency and severely degrading acceptance rates. This paper proposes a training-free, lossless parallel decoding framework: it designs the draft generation process using maximal coupling, and enhances sampling consistency via information-theoretic optimization—preserving the original model’s architecture and loss characteristics while requiring only a single-line code modification to substantially improve acceptance rates. On image generation, it achieves a 4.2× speedup; on video generation, a 13.3× speedup—both without compromising generation quality. The core innovation lies in the first integration of coupled sampling into the SJD framework, uniquely balancing inference efficiency, generation fidelity, and deployment simplicity.
📝 Abstract
While autoregressive (AR) modeling has recently emerged as a new paradigm in visual generation, its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of steps to produce a single sample. To address this challenge, we propose MC-SJD, a training-free, lossless parallel decoding framework designed to accelerate AR visual generation by extending the recently introduced Speculative Jacobi Decoding (SJD). Although SJD shows strong potential for accelerating AR generation, we demonstrate that token instability across iterations significantly reduces the acceptance rate, a limitation that primarily arises from the independent sampling process used during draft token generation. To overcome this, we introduce MC-SJD, an information-theoretic approach based on coupling, which substantially accelerates standard SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, all while preserving its lossless property. Remarkably, this method requires only a single-line modification to the existing algorithm, yet achieves substantial performance gains, delivering up to a ~4.2x acceleration in image generation and ~13.3x acceleration in video generation compared to standard AR decoding, without any degradation in output quality.