SoftCFG: Uncertainty-guided Stable Guidance for Visual autoregressive Model

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

In autoregressive (AR) image generation, Classifier-Free Guidance (CFG) suffers from two fundamental issues: guidance decay—where the divergence between conditional and unconditional outputs diminishes over decoding steps—and over-guidance—where strong conditioning degrades visual coherence. This paper proposes an uncertainty-aware dynamic guidance strategy: it estimates sequence-level confidence to adaptively perturb token generation, and introduces step normalization to suppress error accumulation and ensure stability for long sequences. The method operates entirely at inference time, requires no fine-tuning, and is compatible with existing AR architectures. It unifies SoftCFG and step normalization into a single, lightweight framework. Evaluated on ImageNet 256, our approach achieves the best FID (10.32) among AR models, significantly outperforming standard CFG and state-of-the-art baselines. To our knowledge, this is the first work to systematically mitigate CFG’s intrinsic limitations within the AR paradigm.

Technology Category

Application Category

📝 Abstract

Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens. While Classifier-Free Guidance (CFG) has been adopted to improve conditional generation, its application in AR models faces two key issues: guidance diminishing, where the conditional-unconditional gap quickly vanishes as decoding progresses, and over-guidance, where strong conditions distort visual coherence. To address these challenges, we propose SoftCFG, an uncertainty-guided inference method that distributes adaptive perturbations across all tokens in the sequence. The key idea behind SoftCFG is to let each generated token contribute certainty-weighted guidance, ensuring that the signal persists across steps while resolving conflicts between text guidance and visual context. To further stabilize long-sequence generation, we introduce Step Normalization, which bounds cumulative perturbations of SoftCFG. Our method is training-free, model-agnostic, and seamlessly integrates with existing AR pipelines. Experiments show that SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256 among autoregressive models.

Problem

Research questions and friction points this paper is trying to address.

Addresses guidance diminishing in autoregressive image generation models

Solves over-guidance issues that distort visual coherence in images

Stabilizes long-sequence generation while maintaining text-visual alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive token perturbation using uncertainty guidance

Step normalization stabilizes long sequence generation

Training-free integration with autoregressive image models

🔎 Similar Papers

DepthART: Monocular Depth Estimation as Autoregressive Refinement Task