🤖 AI Summary
Autoregressive (AR) image generation models suffer from semantic inconsistency across patches at different timesteps during progressive resolution scaling, causing misalignment of conditional guidance signals and resulting in blurry outputs with semantic distortions. To address this, we propose Information-Grounded Guidance (IGG), a novel guidance mechanism that dynamically anchors conditional signals to semantically critical regions via attention-based modulation, and adaptively reweights image patches during sampling to enhance fidelity and alignment between guidance and content. IGG is seamlessly integrated into the next-scale prediction AR framework without modifying the backbone architecture. Experiments demonstrate substantial improvements in image sharpness, structural coherence, and semantic fidelity across both class-conditional and text-to-image generation tasks. Our method establishes new state-of-the-art performance for AR models on multiple benchmarks, offering a principled paradigm for controllable autoregressive image synthesis.
📝 Abstract
Autoregressive (AR) models based on next-scale prediction are rapidly emerging as a powerful tool for image generation, but they face a critical weakness: information inconsistencies between patches across timesteps introduced by progressive resolution scaling. These inconsistencies scatter guidance signals, causing them to drift away from conditioning information and leaving behind ambiguous, unfaithful features. We tackle this challenge with Information-Grounding Guidance (IGG), a novel mechanism that anchors guidance to semantically important regions through attention. By adaptively reinforcing informative patches during sampling, IGG ensures that guidance and content remain tightly aligned. Across both class-conditioned and text-to-image generation tasks, IGG delivers sharper, more coherent, and semantically grounded images, setting a new benchmark for AR-based methods.