Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Autoregressive image generation models, which adapt the NLP paradigm of “next-token prediction” to vision, face three fundamental bottlenecks: insufficient modeling of local and conditional dependencies, inter-step semantic inconsistency, and lack of spatial invariance—hindering high-level visual semantic learning. To address these, we propose a “comprehend-then-generate” self-supervised training paradigm, introducing Self-guided Training for AutoRegressive models (ST-AR)—a framework requiring no external pretrained models. Its core innovation lies in a novel self-supervised objective that explicitly guides the model to learn structured semantic representations *before* autoregressive decoding. Evaluated on the LlamaGen family, ST-AR achieves up to 42% (LlamaGen-L) and 49% (LlamaGen-XL) FID improvement, significantly enhancing generation quality while preserving full compatibility with existing sampling strategies. This work is the first to systematically identify and resolve the intrinsic limitations of next-token prediction in visual generative modeling.

Technology Category

Application Category

📝 Abstract

Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Without relying on pre-trained representation models, ST-AR significantly enhances the image understanding ability of autoregressive models and leads to improved generation quality. Specifically, ST-AR brings approximately 42% FID improvement for LlamaGen-L and 49% FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.

Problem

Research questions and friction points this paper is trying to address.

Addressing autoregressive models' visual semantics learning limitations

Improving image understanding without pre-trained representation models

Enhancing generation quality via self-supervised training objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-guided training with self-supervised objectives

Addressing local dependence and semantic inconsistency

Enhancing autoregressive models without pre-trained representations

🔎 Similar Papers

No similar papers found.