Autoregressive Image Generation with Randomized Parallel Decoding

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

Traditional autoregressive image generation models suffer from inefficient inference and poor zero-shot generalization due to fixed raster-scan token ordering. To address this, we propose ARPG—a novel autoregressive model that enables fully random-order token generation without predefined sequencing constraints. Its core innovation lies in a position-guided decoupling mechanism that separates positional queries from content-based keys and values, enabling random-order modeling and synthesis under strictly causal attention—eliminating the need for bidirectional attention. Methodologically, ARPG integrates decoupled query-key-value encoding, causal attention enhancement, and shared key-value caching for parallelized inference. On ImageNet-256, ARPG achieves a FID of 1.94 with only 64 sampling steps, while improving throughput by 20× and reducing GPU memory consumption by over 75%, thereby significantly advancing both efficiency and generalization capability.

Technology Category

Application Category

📝 Abstract

We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel guided decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot tasks such as image inpainting, outpainting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.94 with only 64 sampling steps, achieving over a 20-fold increase in throughput while reducing memory consumption by over 75% compared to representative recent autoregressive models at a similar scale.

Problem

Research questions and friction points this paper is trying to address.

Improves inference efficiency in autoregressive image generation

Enables zero-shot generalization for tasks like inpainting

Reduces memory consumption and increases throughput significantly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Randomized parallel decoding for image generation

Guided decoding framework decouples position and content

Parallel inference with shared KV cache enhances efficiency

🔎 Similar Papers

Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding