🤖 AI Summary
Autoregressive (AR) image generation suffers from structural reconstruction difficulties, inference incoherence, and misalignment with human visual cognition. Method: This work introduces, for the first time, Chain-of-Thought (CoT) reasoning into AR image generation—without modifying model architecture or raster-scan order. Instead, we design image-specific, semantics-aware reasoning prompts that guide the model to first capture global distributional priors before performing pixel-wise generation, thereby achieving cognitive alignment. Our technical approach integrates AR modeling, multi-step CoT prompt engineering, and hierarchical reasoning enhancement. Contribution/Results: Experiments demonstrate a ~20% reduction in Fréchet Inception Distance (FID) over baseline AR methods, alongside substantial improvements in detail fidelity, structural consistency, and cross-regional logical coherence. The proposed framework establishes a new paradigm for AR image generation that is interpretable, stable, and grounded in human visual perception.
📝 Abstract
In the field of autoregressive (AR) image generation, models based on the 'next-token prediction' paradigm of LLMs have shown comparable performance to diffusion models by reducing inductive biases. However, directly applying LLMs to complex image generation can struggle with reconstructing the structure and details of the image, impacting the accuracy and stability of generation. Additionally, the 'next-token prediction' paradigm in the AR model does not align with the contextual scanning and logical reasoning processes involved in human visual perception, limiting effective image generation. Chain-of-Thought (CoT), as a key reasoning capability of LLMs, utilizes reasoning prompts to guide the model, improving reasoning performance on complex natural language process (NLP) tasks, enhancing accuracy and stability of generation, and helping the model maintain contextual coherence and logical consistency, similar to human reasoning. Inspired by CoT from the field of NLP, we propose autoregressive Image Generation with Thoughtful Reasoning (IGTR) to enhance autoregressive image generation. IGTR adds reasoning prompts without modifying the model structure or raster generation order. Specifically, we design specialized image-related reasoning prompts for AR image generation to simulate the human reasoning process, which enhances contextual reasoning by allowing the model to first perceive overall distribution information before generating the image, and improve generation stability by increasing the inference steps. Compared to the AR method without prompts, our method shows outstanding performance and achieves an approximate improvement of 20%.