🤖 AI Summary
Discrete reinforcement learning in language model reasoning often suffers from mode collapse, leading to a loss of solution diversity. To address this issue, this work proposes LaDi-RL, a framework that decouples exploration from text generation by modeling semantic-level reasoning trajectories through guided diffusion in a continuous latent space. By leveraging multi-step denoising, LaDi-RL preserves multiple solution modes without mutual suppression. The approach integrates latent diffusion models with reinforcement learning to enable efficient and diverse policy optimization. Evaluated on code generation and mathematical reasoning tasks, LaDi-RL substantially outperforms discrete RL baselines, achieving absolute pass@1 improvements of 9.4% and 5.7%, respectively, while consistently enhancing pass@k metrics across settings.
📝 Abstract
Recent reinforcement learning (RL) methods improve LLM reasoning by optimizing discrete Chain-of-Thought (CoT) generation; however, exploration in token space often suffers from diversity collapse as policy entropy decreases due to mode elicitation behavior in discrete RL. To mitigate this issue, we propose Latent Diffusion Reasoning with Reinforcement Learning (LaDi-RL), a framework that conducts exploration directly in a continuous latent space, where latent variables encode semantic-level reasoning trajectories. By modeling exploration via guided diffusion, multi-step denoising distributes stochasticity and preserves multiple coexisting solution modes without mutual suppression. Furthermore, by decoupling latent-space exploration from text-space generation, we show that latent diffusion-based optimization is more effective than text-space policy optimization alone, while a complementary text policy provides additional gains when combined with latent exploration. Experiments on code generation and mathematical reasoning benchmarks demonstrate consistent improvements in both pass@1 and pass@k over discrete RL baselines, with absolute pass@1 gains of +9.4% on code generation and +5.7% on mathematical reasoning, highlighting diffusion-based latent RL as a principled alternative to discrete token-level RL for reasoning.