Rethinking Training Dynamics in Scale-wise Autoregressive Generation

📅 2025-12-06

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address exposure bias in scale-aware autoregressive generative models—arising from train-inference mismatch in conditional distributions and imbalanced optimization difficulty across scales—we propose Autoregressive Refinement. Our method introduces two key innovations: (1) a misaligned-scale rollout mechanism that dynamically decouples dependencies across scales, and (2) a Contrastive Student Forcing Loss, which leverages pretrained model outputs as references to calibrate prediction biases during autoregressive inference. The approach is lightweight, plug-and-play, and requires only 10 fine-tuning epochs on ImageNet-256 to achieve a 5.2% FID improvement. It incurs minimal computational overhead, exhibits strong generalization across architectures and datasets, and significantly alleviates train-test distribution mismatch in multi-scale modeling.

Technology Category

Application Category

📝 Abstract

Recent advances in autoregressive (AR) generative models have produced increasingly powerful systems for media synthesis. Among them, next-scale prediction has emerged as a popular paradigm, where models generate images in a coarse-to-fine manner. However, scale-wise AR models suffer from exposure bias, which undermines generation quality. We identify two primary causes of this issue: (1) train-test mismatch, where the model must rely on its own imperfect predictions during inference, and (2) imbalance in scale-wise learning difficulty, where certain scales exhibit disproportionately higher optimization complexity. Through a comprehensive analysis of training dynamics, we propose Self-Autoregressive Refinement (SAR) to address these limitations. SAR introduces a Stagger-Scale Rollout (SSR) mechanism that performs lightweight autoregressive rollouts to expose the model to its own intermediate predictions, thereby aligning train-test patterns, and a complementary Contrastive Student-Forcing Loss (CSFL) that provides adequate supervision for self-generated contexts to ensure stable training. Experimental results show that applying SAR to pretrained AR models consistently improves generation quality with minimal computational overhead. For instance, SAR yields a 5.2% FID reduction on FlexVAR-d16 trained on ImageNet 256 within 10 epochs (5 hours on 32xA100 GPUs). Given its efficiency, scalability, and effectiveness, we expect SAR to serve as a reliable post-training method for visual autoregressive generation.

Problem

Research questions and friction points this paper is trying to address.

Addresses exposure bias in scale-wise autoregressive generation models

Mitigates train-test mismatch and scale-wise learning difficulty imbalance

Proposes Self-Autoregressive Refinement to improve generation quality efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stagger-Scale Rollout aligns train-test patterns

Contrastive Student-Forcing Loss stabilizes self-generated contexts

Lightweight autoregressive rollouts improve generation quality efficiently

🔎 Similar Papers

No similar papers found.