๐ค AI Summary
Existing autoregressive image generation models rely heavily on classifier-free guidance (CFG) and semantic tokenizers, limiting their capacity for holistic semantic modeling of images. This work proposes Heptapod, a novel autoregressive image generator grounded in language modeling principles. Methodologically, it introduces a two-dimensional causal attention mechanism and 2D next-distribution prediction, unifying sequential modeling with masked autoencoding objectives while eliminating dependence on CFG and semantic tokenizers. It employs a causal Transformer architecture coupled with a reconstruction-oriented visual tokenizer, jointly optimizing autoregressive generation and self-supervised learning. Evaluated on ImageNet, Heptapod achieves a state-of-the-art FID score of 2.70โsubstantially outperforming prior causal autoregressive approaches. The framework establishes a more semantically coherent and disentangled generative paradigm for image synthesis.
๐ Abstract
We introduce Heptapod, an image autoregressive model that adheres to the foundational principles of language modeling. Heptapod employs extbf{causal attention}, extbf{eliminates reliance on CFG}, and extbf{eschews the trend of semantic tokenizers}. Our key innovation is extit{next 2D distribution prediction}: a causal Transformer with reconstruction-focused visual tokenizer, learns to predict the distribution over the entire 2D spatial grid of images at each timestep. This learning objective unifies the sequential modeling of autoregressive framework with the holistic self-supervised learning of masked autoencoding, enabling the model to capture comprehensive image semantics via generative training. On the ImageNet generation benchmark, Heptapod achieves an FID of $2.70$, significantly outperforming previous causal autoregressive approaches. We hope our work inspires a principled rethinking of language modeling on visual signals and beyond.