ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality

📅 2024-12-05

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

🤖 AI Summary

Autoregressive (AR) visual generation suffers from low inference efficiency due to its reliance on sequential, token-by-token decoding, requiring numerous forward passes. Method: This paper introduces ZipAR—a training-free, plug-and-play parallel decoding framework that exploits spatial locality among image tokens. ZipAR proposes a novel “next-set prediction” paradigm: while preserving row-major order for foundational token prediction, it concurrently predicts multiple adjacent tokens along the column dimension, extending conventional serial decoding into multi-token parallel generation. Crucially, ZipAR is purely an inference-time scheduling optimization—requiring no model architecture modifications, no additional parameters, and no fine-tuning or retraining. Results: Evaluated on Emu3-Gen, ZipAR reduces forward pass count by 91%, substantially improving throughput while strictly preserving original generation quality—enabling highly efficient AR image synthesis with zero training overhead.

Technology Category

Application Category

📝 Abstract

In this paper, we propose ZipAR, a training-free, plug-and-play parallel decoding framework for accelerating auto-regressive (AR) visual generation. The motivation stems from the observation that images exhibit local structures, and spatially distant regions tend to have minimal interdependence. Given a partially decoded set of visual tokens, in addition to the original next-token prediction scheme in the row dimension, the tokens corresponding to spatially adjacent regions in the column dimension can be decoded in parallel, enabling the ``next-set prediction'' paradigm. By decoding multiple tokens simultaneously in a single forward pass, the number of forward passes required to generate an image is significantly reduced, resulting in a substantial improvement in generation efficiency. Experiments demonstrate that ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining. Code is available here: https://github.com/ThisisBillhe/ZipAR.

Problem

Research questions and friction points this paper is trying to address.

Accelerate auto-regressive image generation with parallel decoding

Reduce model forward passes by exploiting spatial locality

Improve generation efficiency without requiring retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel decoding for faster image generation

Leverages spatial locality in visual tokens

Reduces forward passes without retraining

🔎 Similar Papers

Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding

2024-10-02arXiv.orgCitations: 4

Authors to Follow