DiRL: An Efficient Post-Training Framework for Diffusion Language Models

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low post-training efficiency and training-inference objective misalignment of diffusion language models (dLLMs), which severely hinder their complex reasoning capabilities (e.g., mathematical reasoning), this paper proposes DiRL—a holistic, end-to-end efficient post-training framework. Methodologically, DiRL introduces (1) DiPO, the first unbiased groupwise relative policy optimization algorithm tailored for dLLMs, effectively mitigating policy bias; and (2) a tightly integrated training-inference co-design, combining FlexAttention-accelerated block-level training with LMDeploy-optimized inference deployment. Evaluated on DiRL-8B-Instruct, the framework achieves state-of-the-art performance among dLLMs on multiple mathematical reasoning benchmarks, surpassing comparably sized Qwen2.5-series models.

Technology Category

Application Category

📝 Abstract
Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models. While recent efforts have validated their pre-training potential and accelerated inference speeds, the post-training landscape for dLLMs remains underdeveloped. Existing methods suffer from computational inefficiency and objective mismatches between training and inference, severely limiting performance on complex reasoning tasks such as mathematics. To address this, we introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference. This architecture enables a streamlined online model update loop, facilitating efficient two-stage post-training (Supervised Fine-Tuning followed by Reinforcement Learning). Building on this framework, we propose DiPO, the first unbiased Group Relative Policy Optimization (GRPO) implementation tailored for dLLMs. We validate our approach by training DiRL-8B-Instruct on high-quality math data. Our model achieves state-of-the-art math performance among dLLMs and surpasses comparable models in the Qwen2.5 series on several benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient post-training for diffusion language models.
Improves performance on complex reasoning tasks like mathematics.
Introduces an unbiased optimization method for diffusion models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

FlexAttention-accelerated blockwise training for efficiency
LMDeploy-optimized inference to align training and inference
Unbiased Group Relative Policy Optimization tailored for dLLMs
🔎 Similar Papers
No similar papers found.
Y
Ying Zhu
Fudan University, Shanghai Innovation Institute, OpenMoss Team
J
Jiaxin Wan
Shanghai Innovation Institute, OpenMoss Team
Xiaoran Liu
Xiaoran Liu
Fudan University
natural language processing
S
Siyanag He
Fudan University, Shanghai Innovation Institute, OpenMoss Team
Q
Qiqi Wang
Fudan University, Shanghai Innovation Institute, OpenMoss Team
X
Xu Guo
Fudan University, Shanghai Innovation Institute
Tianyi Liang
Tianyi Liang
PHD, East China Normal University, Shanghai AI Lab,Shanghai Innovation Institute
Multimodal LearningLLMsImage Editing
Zengfeng Huang
Zengfeng Huang
Fudan University
AlgorithmsGraphsStreamingLearningTheory
Ziwei He
Ziwei He
Shanghai Jiao Tong University
Machine Learning
X
Xipeng Qiu
Fudan University, Shanghai Innovation Institute