Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large codebooks in autoregressive image generation suffer from vocabulary explosion, leading to high modeling complexity and a trade-off between reconstruction quality and inference efficiency. Method: This paper proposes a Coarse-to-Fine (CTF) hierarchical token prediction paradigm. Motivated by the discovery of significant redundancy among semantically similar tokens in large codebooks, we introduce a two-stage decoupled modeling strategy: (1) clustering to construct a compact coarse label set, modeled globally via autoregression for structural coherence; and (2) parallel refinement of fine-grained tokens conditioned on coarse labels. The approach integrates VQ-VAE-based discrete representation learning with a two-level conditional autoregressive architecture. Results: On ImageNet, our method achieves an average +59 improvement in Inception Score while accelerating sampling beyond baseline methods—demonstrating simultaneous gains in both fidelity and efficiency.

Technology Category

Application Category

📝 Abstract
Autoregressive models have shown remarkable success in image generation by adapting sequential prediction techniques from language modeling. However, applying these approaches to images requires discretizing continuous pixel data through vector quantization methods like VQ-VAE. To alleviate the quantization errors that existed in VQ-VAE, recent works tend to use larger codebooks. However, this will accordingly expand vocabulary size, complicating the autoregressive modeling task. This paper aims to find a way to enjoy the benefits of large codebooks without making autoregressive modeling more difficult. Through empirical investigation, we discover that tokens with similar codeword representations produce similar effects on the final generated image, revealing significant redundancy in large codebooks. Based on this insight, we propose to predict tokens from coarse to fine (CTF), realized by assigning the same coarse label for similar tokens. Our framework consists of two stages: (1) an autoregressive model that sequentially predicts coarse labels for each token in the sequence, and (2) an auxiliary model that simultaneously predicts fine-grained labels for all tokens conditioned on their coarse labels. Experiments on ImageNet demonstrate our method's superior performance, achieving an average improvement of 59 points in Inception Score compared to baselines. Notably, despite adding an inference step, our approach achieves faster sampling speeds.
Problem

Research questions and friction points this paper is trying to address.

Reduces quantization errors in autoregressive image generation.
Manages large codebooks without complicating autoregressive modeling.
Improves image quality and sampling speed through coarse-to-fine token prediction.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine token prediction reduces redundancy
Two-stage model: coarse label prediction first
Improved image generation with faster sampling
🔎 Similar Papers
No similar papers found.