ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality

📅 2024-12-05
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
Autoregressive (AR) visual generation suffers from low inference efficiency due to its reliance on sequential, token-by-token decoding, requiring numerous forward passes. Method: This paper introduces ZipAR—a training-free, plug-and-play parallel decoding framework that exploits spatial locality among image tokens. ZipAR proposes a novel “next-set prediction” paradigm: while preserving row-major order for foundational token prediction, it concurrently predicts multiple adjacent tokens along the column dimension, extending conventional serial decoding into multi-token parallel generation. Crucially, ZipAR is purely an inference-time scheduling optimization—requiring no model architecture modifications, no additional parameters, and no fine-tuning or retraining. Results: Evaluated on Emu3-Gen, ZipAR reduces forward pass count by 91%, substantially improving throughput while strictly preserving original generation quality—enabling highly efficient AR image synthesis with zero training overhead.

Technology Category

Application Category

📝 Abstract
In this paper, we propose ZipAR, a training-free, plug-and-play parallel decoding framework for accelerating auto-regressive (AR) visual generation. The motivation stems from the observation that images exhibit local structures, and spatially distant regions tend to have minimal interdependence. Given a partially decoded set of visual tokens, in addition to the original next-token prediction scheme in the row dimension, the tokens corresponding to spatially adjacent regions in the column dimension can be decoded in parallel, enabling the ``next-set prediction'' paradigm. By decoding multiple tokens simultaneously in a single forward pass, the number of forward passes required to generate an image is significantly reduced, resulting in a substantial improvement in generation efficiency. Experiments demonstrate that ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining. Code is available here: https://github.com/ThisisBillhe/ZipAR.
Problem

Research questions and friction points this paper is trying to address.

Accelerate auto-regressive image generation with parallel decoding
Reduce model forward passes by exploiting spatial locality
Improve generation efficiency without requiring retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel decoding for faster image generation
Leverages spatial locality in visual tokens
Reduces forward passes without retraining
Yefei He
Yefei He
Zhejiang University
Computer VisionAutoregressive Visual GenerationModel Quantization
F
Feng Chen
The University of Adelaide, Australia
Y
Yuanyu He
Zhejiang University, China
S
Shaoxuan He
Zhejiang University, China
H
Hong Zhou
Zhejiang University, China
Kaipeng Zhang
Kaipeng Zhang
Shanghai AI Laboratory
LLMMultimodal LLMsAIGC
Bohan Zhuang
Bohan Zhuang
Zhejiang University
Efficient AIMLSys