Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion-based language models (dLLMs) suffer from erroneous path reinforcement and uncontrolled reasoning processes in complex reasoning tasks, primarily due to sparse, outcome-only reward signals. Method: This paper proposes a hierarchical, step-aware reinforcement learning framework that explicitly models problem solving as a layered decision process—grounded in a novel implicit reasoning hierarchy theory—and introduces a process-oriented, fine-grained reward function to guide and regulate each reasoning step with interpretability. The method integrates diffusion mechanisms, hierarchical modeling, and structured RL training. Contribution/Results: Evaluated on multiple challenging reasoning benchmarks, the framework achieves significant improvements in both answer accuracy and reasoning path logicality. It demonstrates strong generalization across diverse reasoning domains and provides transparent, step-level controllability—validating its effectiveness, robustness, and interpretability for complex reasoning with dLLMs.

Technology Category

Application Category

📝 Abstract
Diffusion language models (dLLMs) offer a promising, non-autoregressive paradigm for text generation, yet training them for complex reasoning remains a key challenge. Current reinforcement learning approaches often rely on sparse, outcome-based rewards, which can reinforce flawed reasoning paths that lead to coincidentally correct answers. We argue that this stems from a fundamental mismatch with the natural structure of reasoning. We first propose a theoretical framework that formalizes complex problem solving as a hierarchical selection process, where an intractable global constraint is decomposed into a series of simpler, localized logical steps. This framework provides a principled foundation for algorithm design, including theoretical insights into the identifiability of this latent reasoning structure. Motivated by this theory, we identify unstructured refinement -- a failure mode where a model's iterative steps do not contribute meaningfully to the solution -- as a core deficiency in existing methods. We then introduce Step-Aware Policy Optimization (SAPO), a novel RL algorithm that aligns the dLLM's denoising process with the latent reasoning hierarchy. By using a process-based reward function that encourages incremental progress, SAPO guides the model to learn structured, coherent reasoning paths. Our empirical results show that this principled approach significantly improves performance on challenging reasoning benchmarks and enhances the interpretability of the generation process.
Problem

Research questions and friction points this paper is trying to address.

Training diffusion language models for complex reasoning tasks effectively
Addressing flawed reasoning paths reinforced by sparse outcome-based rewards
Solving unstructured refinement in iterative reasoning steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical selection process for complex problem solving
Step-Aware Policy Optimization algorithm for reasoning
Process-based reward function for incremental progress
🔎 Similar Papers
No similar papers found.
Shaoan Xie
Shaoan Xie
Carnegie Mellon University
Representation LearningGenerative ModelCausality
Lingjing Kong
Lingjing Kong
Carnegie Mellon University
Machine Learning
Xiangchen Song
Xiangchen Song
Carnegie Mellon University
Machine LearningCausalityData Mining
Xinshuai Dong
Xinshuai Dong
Carnegie Mellon University
ML
G
Guangyi Chen
Carnegie Mellon University, Pittsburgh, PA, USA; Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
E
Eric P. Xing
Carnegie Mellon University, Pittsburgh, PA, USA; Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
K
Kun Zhang
Carnegie Mellon University, Pittsburgh, PA, USA; Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE