SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the longstanding trade-off between training efficiency and inference speed in sequence generation, aiming to unify the training scalability of autoregressive (AR) models with the parallel decoding capability of diffusion models. To this end, we propose SDAR (Sequential Diffusion-Augmented Regression): a paradigm that transforms a pretrained AR model into a block-wise discrete diffusion model via lightweight architectural reparameterization—enabling autoregressive inter-block generation and parallel intra-block decoding. Crucially, SDAR avoids full retraining, requiring only adaptation fine-tuning. We further introduce test-time scaling strategies (e.g., pass@k, majority voting) and a dense-expert mixture architecture to enhance generalization. Experiments demonstrate that a 30B MoE-scale SDAR model matches or exceeds the performance of its AR counterpart on scientific reasoning benchmarks—including GPQA and ChemBench—while achieving substantial inference acceleration.

Technology Category

Application Category

📝 Abstract

We propose SDAR, a Synergistic Diffusion-Autoregression paradigm that unifies the training efficiency of autoregressive models with the parallel inference capability of diffusion. Instead of costly end-to-end diffusion training, SDAR performs a lightweight paradigm conversion that transforms a well-trained autoregressive (AR) model into a blockwise diffusion model through brief, data-efficient adaptation. During inference, SDAR generates sequences autoregressively across blocks for global coherence while decoding all tokens within each block in parallel via a discrete diffusion process. Extensive experiments show that AR models remain substantially more compute-efficient than masked diffusion models, providing a strong foundation for adaptation. Building on this insight, SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation. Scaling studies across dense and Mixture-of-Experts architectures confirm that SDAR scales without compromise: larger models exhibit stronger robustness to block size and decoding thresholds, yielding greater speedups without accuracy loss. Beyond efficiency, SDAR demonstrates enhanced reasoning and domain adaptability. Our 30B MoE model surpasses its AR counterpart on challenging scientific reasoning benchmarks such as GPQA and ChemBench, and gains further improvements under test-time scaling methods like majority voting and pass@k. Together, these results establish SDAR as a practical paradigm that combines the strengths of autoregression and diffusion for scalable, high-throughput reasoning.

Problem

Research questions and friction points this paper is trying to address.

Unifies AR training efficiency with diffusion parallel inference

Converts AR models to diffusion via lightweight adaptation

Enables parallel generation while preserving AR performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts autoregressive models to diffusion models

Generates sequences with parallel block decoding

Enables scalable reasoning without accuracy loss

🔎 Similar Papers

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion