Heptapod: Language Modeling on Visual Signals

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing autoregressive image generation models rely heavily on classifier-free guidance (CFG) and semantic tokenizers, limiting their capacity for holistic semantic modeling of images. This work proposes Heptapod, a novel autoregressive image generator grounded in language modeling principles. Methodologically, it introduces a two-dimensional causal attention mechanism and 2D next-distribution prediction, unifying sequential modeling with masked autoencoding objectives while eliminating dependence on CFG and semantic tokenizers. It employs a causal Transformer architecture coupled with a reconstruction-oriented visual tokenizer, jointly optimizing autoregressive generation and self-supervised learning. Evaluated on ImageNet, Heptapod achieves a state-of-the-art FID score of 2.70—substantially outperforming prior causal autoregressive approaches. The framework establishes a more semantically coherent and disentangled generative paradigm for image synthesis.

Technology Category

Application Category

📝 Abstract

We introduce Heptapod, an image autoregressive model that adheres to the foundational principles of language modeling. Heptapod employs extbf{causal attention}, extbf{eliminates reliance on CFG}, and extbf{eschews the trend of semantic tokenizers}. Our key innovation is extit{next 2D distribution prediction}: a causal Transformer with reconstruction-focused visual tokenizer, learns to predict the distribution over the entire 2D spatial grid of images at each timestep. This learning objective unifies the sequential modeling of autoregressive framework with the holistic self-supervised learning of masked autoencoding, enabling the model to capture comprehensive image semantics via generative training. On the ImageNet generation benchmark, Heptapod achieves an FID of $2.70$, significantly outperforming previous causal autoregressive approaches. We hope our work inspires a principled rethinking of language modeling on visual signals and beyond.

Problem

Research questions and friction points this paper is trying to address.

Develops causal autoregressive model for image generation

Eliminates classifier-free guidance and semantic tokenizers dependency

Predicts next 2D distribution to unify sequential and holistic learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Employs causal attention and eliminates CFG reliance

Uses next 2D distribution prediction for image modeling

Combines autoregressive and masked autoencoding learning objectives

🔎 Similar Papers

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives