Autoregressive Visual Generation Needs a Prologue

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the misalignment between reconstruction and generation objectives in autoregressive image modeling by introducing Prologue, a method that prepends a small set of learnable prologue tokens to the visual token sequence. Trained solely with autoregressive cross-entropy loss, Prologue decouples generative optimization from the reconstruction task. It is the first approach to enhance generation quality through a dedicated generative representation without compromising reconstruction fidelity, supported by a theoretical interpretation from the ELBO perspective. Experiments demonstrate that Prologue-Base reduces gFID on ImageNet 256×256 from 21.01 to 10.75, while Prologue-Large achieves an rFID of 0.99 and a gFID of 1.46. Remarkably, a linear probe using only 16 prologue tokens attains 35.88% Top-1 classification accuracy, revealing emergent semantic structure within the prologue representation.

📝 Abstract

In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model's true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic supervision. Interestingly, driven only by AR gradients, prologue tokens exhibit emergent semantic structure: linear probing on 16 prologue tokens reaches 35.88% Top-1, far above the 23.71% of the first 16 tokens from a standard tokenizer; resampling with fixed prologue tokens preserves a similar high-level semantic layout. Our results suggest a new direction: generation quality can be improved by introducing a separate learned generative representation while leaving the original representation intact.

Problem

Research questions and friction points this paper is trying to address.

autoregressive generation

reconstruction-generation gap

visual representation

image generation

generative modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

autoregressive generation

prologue tokens

reconstruction-generation gap