🤖 AI Summary
This work addresses the longstanding trade-off between training efficiency and inference speed in sequence generation, aiming to unify the training scalability of autoregressive (AR) models with the parallel decoding capability of diffusion models. To this end, we propose SDAR (Sequential Diffusion-Augmented Regression): a paradigm that transforms a pretrained AR model into a block-wise discrete diffusion model via lightweight architectural reparameterization—enabling autoregressive inter-block generation and parallel intra-block decoding. Crucially, SDAR avoids full retraining, requiring only adaptation fine-tuning. We further introduce test-time scaling strategies (e.g., pass@k, majority voting) and a dense-expert mixture architecture to enhance generalization. Experiments demonstrate that a 30B MoE-scale SDAR model matches or exceeds the performance of its AR counterpart on scientific reasoning benchmarks—including GPQA and ChemBench—while achieving substantial inference acceleration.
📝 Abstract
We propose SDAR, a Synergistic Diffusion-Autoregression paradigm that unifies the training efficiency of autoregressive models with the parallel inference capability of diffusion. Instead of costly end-to-end diffusion training, SDAR performs a lightweight paradigm conversion that transforms a well-trained autoregressive (AR) model into a blockwise diffusion model through brief, data-efficient adaptation. During inference, SDAR generates sequences autoregressively across blocks for global coherence while decoding all tokens within each block in parallel via a discrete diffusion process. Extensive experiments show that AR models remain substantially more compute-efficient than masked diffusion models, providing a strong foundation for adaptation. Building on this insight, SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation. Scaling studies across dense and Mixture-of-Experts architectures confirm that SDAR scales without compromise: larger models exhibit stronger robustness to block size and decoding thresholds, yielding greater speedups without accuracy loss. Beyond efficiency, SDAR demonstrates enhanced reasoning and domain adaptability. Our 30B MoE model surpasses its AR counterpart on challenging scientific reasoning benchmarks such as GPQA and ChemBench, and gains further improvements under test-time scaling methods like majority voting and pass@k. Together, these results establish SDAR as a practical paradigm that combines the strengths of autoregression and diffusion for scalable, high-throughput reasoning.