SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the inconsistency between training and inference in multiscale visual autoregressive (VAR) generation, where limited model capacity and error accumulation often cause deviations from the intended coarse-to-fine hierarchical structure during inference. To mitigate this issue without additional training, the authors propose Scaled Spatial Guidance (SSG), an inference-time guidance method grounded in information-theoretic insights into the origins of generation bias. SSG leverages Discrete Spatial Enhancement (DSE) in the frequency domain to precisely extract high-frequency semantic residuals unique to each scale, which are then used to construct coarse-grained priors. These priors spatially guide the autoregressive generation of discrete visual tokens during inference. Experiments demonstrate that SSG consistently enhances both fidelity and diversity of generated images across multiple VAR models while maintaining low latency, thereby fully unlocking the efficiency potential of coarse-to-fine generative paradigms.

Technology Category

Application Category

📝 Abstract

Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at https://github.com/Youngwoo-git/SSG.

Problem

Research questions and friction points this paper is trying to address.

visual autoregressive generation

coarse-to-fine hierarchy

train-inference discrepancy

multi-scale generation

high-frequency content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaled Spatial Guidance

Visual Autoregressive Generation

Semantic Residual