MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion language models (DLMs) suffer from poor reasoning performance compared to autoregressive LMs, primarily because token generation during denoising is inherently independent—hindering effective modeling of both intra-sequence and inter-sequence token dependencies. This work introduces the first formal definition and joint modeling of these two types of token correlations. We propose a multi-reward optimization framework that dynamically reinforces correlation learning throughout the denoising process. To stabilize training and improve inference efficiency, we integrate test-time expansion, rejection sampling, grouped step scheduling, and importance sampling—collectively reducing reward variance and accelerating sampling. Experiments across multiple reasoning benchmarks demonstrate substantial performance gains over prior DLMs, alongside significantly faster sampling speeds, thereby achieving a favorable trade-off between output quality and computational efficiency.

Technology Category

Application Category

📝 Abstract
Recent advances in diffusion language models (DLMs) have presented a promising alternative to traditional autoregressive large language models (LLMs). However, DLMs still lag behind LLMs in reasoning performance, especially as the number of denoising steps decreases. Our analysis reveals that this shortcoming arises primarily from the independent generation of masked tokens across denoising steps, which fails to capture the token correlation. In this paper, we define two types of token correlation: intra-sequence correlation and inter-sequence correlation, and demonstrate that enhancing these correlations improves reasoning performance. To this end, we propose a Multi-Reward Optimization (MRO) approach, which encourages DLMs to consider the token correlation during the denoising process. More specifically, our MRO approach leverages test-time scaling, reject sampling, and reinforcement learning to directly optimize the token correlation with multiple elaborate rewards. Additionally, we introduce group step and importance sampling strategies to mitigate reward variance and enhance sampling efficiency. Through extensive experiments, we demonstrate that MRO not only improves reasoning performance but also achieves significant sampling speedups while maintaining high performance on reasoning benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Improving reasoning in diffusion language models
Addressing poor token correlation across denoising steps
Optimizing multiple rewards for enhanced reasoning performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Reward Optimization for token correlation
Test-time scaling with reject sampling
Group step and importance sampling strategies
C
Chenglong Wang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Y
Yang Gan
School of Computer Science and Engineering, Northeastern University, Shenyang, China
H
Hang Zhou
School of Computer Science and Engineering, Northeastern University, Shenyang, China
C
Chi Hu
ByteDance
Yongyu Mu
Yongyu Mu
Northeastern University
multilingualismmachine translationefficient models
Kai Song
Kai Song
TikTok Inc.
NLP & LLM
M
Murun Yang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Bei Li
Bei Li
Meituan LLM Team
Machine TranslationDeep LearningLarge Language Models
C
Chunliang Zhang
NiuTrans Research, Shenyang, China
T
Tongran Liu
CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China
Jingbo Zhu
Jingbo Zhu
Northeastern University, China
Machine TranslationLanguage ParsingNatural Language Processing
Zhengtao Yu
Zhengtao Yu
Kunming University of Science and Technology
T
Tong Xiao
NiuTrans Research, Shenyang, China