Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Efficiently transferring pretrained autoregressive language models (AR-LMs) to diffusion language models (dLMs) remains challenging due to fundamental architectural and training–inference distribution mismatches. Method: This paper proposes a high-fidelity weight transfer framework featuring intra-block bidirectional and inter-block causal attention, coupled with position-aware dynamic masking to bridge the mask distribution gap between training and inference while maintaining KV-cache compatibility; it further employs continuous pretraining and non-autoregressive parallel decoding. Contribution/Results: To our knowledge, this is the first method enabling seamless, zero-shot conversion of AR-LMs into dLMs without fine-tuning. Experiments show that Efficient-DLM 8B achieves 5.4% and 2.7% accuracy gains over Dream 7B and Qwen3 4B, respectively, while delivering 4.5× and 2.7× higher throughput—all without sacrificing task accuracy.

Technology Category

Application Category

📝 Abstract
Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.
Problem

Research questions and friction points this paper is trying to address.

Converting autoregressive models to efficient diffusion language models
Improving training efficiency and accuracy of diffusion language models
Enhancing speed and performance of non-autoregressive language generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Convert autoregressive to diffusion models for speed
Use block-wise attention to preserve pretrained weight distributions
Apply position-dependent masking to align training and test
🔎 Similar Papers
No similar papers found.