🤖 AI Summary
Autoregressive (AR) visual generation suffers from low inference efficiency due to its reliance on sequential, token-by-token decoding, requiring numerous forward passes. Method: This paper introduces ZipAR—a training-free, plug-and-play parallel decoding framework that exploits spatial locality among image tokens. ZipAR proposes a novel “next-set prediction” paradigm: while preserving row-major order for foundational token prediction, it concurrently predicts multiple adjacent tokens along the column dimension, extending conventional serial decoding into multi-token parallel generation. Crucially, ZipAR is purely an inference-time scheduling optimization—requiring no model architecture modifications, no additional parameters, and no fine-tuning or retraining. Results: Evaluated on Emu3-Gen, ZipAR reduces forward pass count by 91%, substantially improving throughput while strictly preserving original generation quality—enabling highly efficient AR image synthesis with zero training overhead.
📝 Abstract
In this paper, we propose ZipAR, a training-free, plug-and-play parallel decoding framework for accelerating auto-regressive (AR) visual generation. The motivation stems from the observation that images exhibit local structures, and spatially distant regions tend to have minimal interdependence. Given a partially decoded set of visual tokens, in addition to the original next-token prediction scheme in the row dimension, the tokens corresponding to spatially adjacent regions in the column dimension can be decoded in parallel, enabling the ``next-set prediction'' paradigm. By decoding multiple tokens simultaneously in a single forward pass, the number of forward passes required to generate an image is significantly reduced, resulting in a substantial improvement in generation efficiency. Experiments demonstrate that ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining. Code is available here: https://github.com/ThisisBillhe/ZipAR.