SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the longstanding trade-off between training efficiency and inference speed in sequence generation, aiming to unify the training scalability of autoregressive (AR) models with the parallel decoding capability of diffusion models. To this end, we propose SDAR (Sequential Diffusion-Augmented Regression): a paradigm that transforms a pretrained AR model into a block-wise discrete diffusion model via lightweight architectural reparameterization—enabling autoregressive inter-block generation and parallel intra-block decoding. Crucially, SDAR avoids full retraining, requiring only adaptation fine-tuning. We further introduce test-time scaling strategies (e.g., pass@k, majority voting) and a dense-expert mixture architecture to enhance generalization. Experiments demonstrate that a 30B MoE-scale SDAR model matches or exceeds the performance of its AR counterpart on scientific reasoning benchmarks—including GPQA and ChemBench—while achieving substantial inference acceleration.

Technology Category

Application Category

📝 Abstract
We propose SDAR, a Synergistic Diffusion-Autoregression paradigm that unifies the training efficiency of autoregressive models with the parallel inference capability of diffusion. Instead of costly end-to-end diffusion training, SDAR performs a lightweight paradigm conversion that transforms a well-trained autoregressive (AR) model into a blockwise diffusion model through brief, data-efficient adaptation. During inference, SDAR generates sequences autoregressively across blocks for global coherence while decoding all tokens within each block in parallel via a discrete diffusion process. Extensive experiments show that AR models remain substantially more compute-efficient than masked diffusion models, providing a strong foundation for adaptation. Building on this insight, SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation. Scaling studies across dense and Mixture-of-Experts architectures confirm that SDAR scales without compromise: larger models exhibit stronger robustness to block size and decoding thresholds, yielding greater speedups without accuracy loss. Beyond efficiency, SDAR demonstrates enhanced reasoning and domain adaptability. Our 30B MoE model surpasses its AR counterpart on challenging scientific reasoning benchmarks such as GPQA and ChemBench, and gains further improvements under test-time scaling methods like majority voting and pass@k. Together, these results establish SDAR as a practical paradigm that combines the strengths of autoregression and diffusion for scalable, high-throughput reasoning.
Problem

Research questions and friction points this paper is trying to address.

Unifies AR training efficiency with diffusion parallel inference
Converts AR models to diffusion via lightweight adaptation
Enables parallel generation while preserving AR performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts autoregressive models to diffusion models
Generates sequences with parallel block decoding
Enables scalable reasoning without accuracy loss
🔎 Similar Papers
No similar papers found.
S
Shuang Cheng
Shanghai AI Laboratory
Y
Yihan Bian
University of Maryland, College Park
Dawei Liu
Dawei Liu
Shanghai AI Laboratory
Yuhua Jiang
Yuhua Jiang
Tsinghua University
reinforcement learning
Y
Yihao Liu
Shanghai AI Laboratory
Linfeng Zhang
Linfeng Zhang
DP Technology; AI for Science Institute
AI for Sciencemulti-scale modelingmolecular simulationdrug/materials design
W
Wenhai Wang
Shanghai AI Laboratory
Qipeng Guo
Qipeng Guo
Fudan University
K
Kai Chen
Shanghai AI Laboratory
B
Biqing Qi
Shanghai AI Laboratory
B
Bowen Zhou
Shanghai AI Laboratory