Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Autoregressive (AR) image generation models suffer from semantic inconsistency across patches at different timesteps during progressive resolution scaling, causing misalignment of conditional guidance signals and resulting in blurry outputs with semantic distortions. To address this, we propose Information-Grounded Guidance (IGG), a novel guidance mechanism that dynamically anchors conditional signals to semantically critical regions via attention-based modulation, and adaptively reweights image patches during sampling to enhance fidelity and alignment between guidance and content. IGG is seamlessly integrated into the next-scale prediction AR framework without modifying the backbone architecture. Experiments demonstrate substantial improvements in image sharpness, structural coherence, and semantic fidelity across both class-conditional and text-to-image generation tasks. Our method establishes new state-of-the-art performance for AR models on multiple benchmarks, offering a principled paradigm for controllable autoregressive image synthesis.

Technology Category

Application Category

📝 Abstract

Autoregressive (AR) models based on next-scale prediction are rapidly emerging as a powerful tool for image generation, but they face a critical weakness: information inconsistencies between patches across timesteps introduced by progressive resolution scaling. These inconsistencies scatter guidance signals, causing them to drift away from conditioning information and leaving behind ambiguous, unfaithful features. We tackle this challenge with Information-Grounding Guidance (IGG), a novel mechanism that anchors guidance to semantically important regions through attention. By adaptively reinforcing informative patches during sampling, IGG ensures that guidance and content remain tightly aligned. Across both class-conditioned and text-to-image generation tasks, IGG delivers sharper, more coherent, and semantically grounded images, setting a new benchmark for AR-based methods.

Problem

Research questions and friction points this paper is trying to address.

Addresses information inconsistencies in autoregressive image generation models

Solves scattered guidance signals causing ambiguous and unfaithful features

Anchors guidance to semantically important regions using attention mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Anchors guidance to important regions via attention

Adaptively reinforces informative patches during sampling

Ensures tight alignment between guidance and content

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling