Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Autoregressive image generation models suffer from slow inference due to iterative sampling, failing real-time requirements; existing few-step approaches (e.g., DD1) incur substantial performance degradation under single-step sampling and rely on predefined mappings, limiting flexibility. To address this, we propose Conditional Score Distillation (CSD), a novel framework that eliminates the need for handcrafted mappings by directly leveraging the teacher model’s conditional score functions in latent space as supervision—guiding the student to predict per-token scores for true one-shot high-fidelity generation. CSD integrates autoregressive distillation, a single-step generator architecture, and prefix-aware score modeling. On ImageNet-256, CSD achieves a FID of 5.43—only +2.03 over the original autoregressive baseline (FID=3.40)—reducing the performance gap with DD1 by 67%. Training is accelerated 12.3×, significantly improving both efficiency and generalization.

Technology Category

Application Category

📝 Abstract
Image Auto-regressive (AR) models have emerged as a powerful paradigm of visual generative models. Despite their promising performance, they suffer from slow generation speed due to the large number of sampling steps required. Although Distilled Decoding 1 (DD1) was recently proposed to enable few-step sampling for image AR models, it still incurs significant performance degradation in the one-step setting, and relies on a pre-defined mapping that limits its flexibility. In this work, we propose a new method, Distilled Decoding 2 (DD2), to further advances the feasibility of one-step sampling for image AR models. Unlike DD1, DD2 does not without rely on a pre-defined mapping. We view the original AR model as a teacher model which provides the ground truth conditional score in the latent embedding space at each token position. Based on this, we propose a novel emph{conditional score distillation loss} to train a one-step generator. Specifically, we train a separate network to predict the conditional score of the generated distribution and apply score distillation at every token position conditioned on previous tokens. Experimental results show that DD2 enables one-step sampling for image AR models with an minimal FID increase from 3.40 to 5.43 on ImageNet-256. Compared to the strongest baseline DD1, DD2 reduces the gap between the one-step sampling and original AR model by 67%, with up to 12.3$ imes$ training speed-up simultaneously. DD2 takes a significant step toward the goal of one-step AR generation, opening up new possibilities for fast and high-quality AR modeling. Code is available at https://github.com/imagination-research/Distilled-Decoding-2.
Problem

Research questions and friction points this paper is trying to address.

Enabling one-step sampling for slow autoregressive image models
Eliminating reliance on predefined mappings in distilled decoding
Minimizing performance degradation in one-step autoregressive generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses conditional score distillation for one-step generation
Trains separate network to predict conditional scores
Eliminates pre-defined mapping dependency in distillation
🔎 Similar Papers
2024-03-04Computer Vision and Pattern RecognitionCitations: 3