🤖 AI Summary
Diffusion language models (DLMs) suffer from poor reasoning performance compared to autoregressive LMs, primarily because token generation during denoising is inherently independent—hindering effective modeling of both intra-sequence and inter-sequence token dependencies. This work introduces the first formal definition and joint modeling of these two types of token correlations. We propose a multi-reward optimization framework that dynamically reinforces correlation learning throughout the denoising process. To stabilize training and improve inference efficiency, we integrate test-time expansion, rejection sampling, grouped step scheduling, and importance sampling—collectively reducing reward variance and accelerating sampling. Experiments across multiple reasoning benchmarks demonstrate substantial performance gains over prior DLMs, alongside significantly faster sampling speeds, thereby achieving a favorable trade-off between output quality and computational efficiency.
📝 Abstract
Recent advances in diffusion language models (DLMs) have presented a promising alternative to traditional autoregressive large language models (LLMs). However, DLMs still lag behind LLMs in reasoning performance, especially as the number of denoising steps decreases. Our analysis reveals that this shortcoming arises primarily from the independent generation of masked tokens across denoising steps, which fails to capture the token correlation. In this paper, we define two types of token correlation: intra-sequence correlation and inter-sequence correlation, and demonstrate that enhancing these correlations improves reasoning performance. To this end, we propose a Multi-Reward Optimization (MRO) approach, which encourages DLMs to consider the token correlation during the denoising process. More specifically, our MRO approach leverages test-time scaling, reject sampling, and reinforcement learning to directly optimize the token correlation with multiple elaborate rewards. Additionally, we introduce group step and importance sampling strategies to mitigate reward variance and enhance sampling efficiency. Through extensive experiments, we demonstrate that MRO not only improves reasoning performance but also achieves significant sampling speedups while maintaining high performance on reasoning benchmarks.