DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning approaches for mathematical reasoning with large language models struggle to balance efficiency and stability: PPO-based methods are stable but suffer from slow convergence, while REINFORCE-based methods are efficient yet prone to instability. This work proposes DISPO, an algorithm that, for the first time, decouples the clipping of importance sampling weights based on whether model responses are correct or incorrect, introducing a four-parameter controllable clipping mechanism. Within a REINFORCE framework, this enables fine-grained policy updates that effectively balance exploration and distillation. The method substantially improves both training stability and learning efficiency, achieving a 61.04% accuracy on AIME'24—outperforming CISPO (55.42%) and DAPO (50.21%)—and demonstrates consistent gains across multiple models and benchmarks.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models particularly in mathematics. Current approaches in this domain present a clear trade-off: PPO-style methods (e.g., GRPO/DAPO) offer training stability but exhibit slow learning trajectories due to their trust-region constraints on policy updates, while REINFORCE-style approaches (e.g., CISPO) demonstrate improved learning efficiency but suffer from performance instability as they clip importance sampling weights while still permitting non-zero gradients outside the trust-region. To address these limitations, we introduce DISPO, a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses, yielding four controllable policy update regimes. Through targeted ablations, we uncover how each regime impacts training: for correct responses, weights>1 increase the average token entropy (i.e., exploration) while weights<1 decrease it (i.e., distillation) -- both beneficial but causing gradual performance degradation when excessive. For incorrect responses, overly restrictive clipping triggers sudden performance collapse through repetitive outputs (when weights>1) or vanishing response lengths (when weights<1). By separately tuning these four clipping parameters, DISPO maintains the exploration-distillation balance while preventing catastrophic failures, achieving 61.04% on AIME'24 (vs. 55.42% CISPO and 50.21% DAPO) with similar gains across various benchmarks and models.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
large language models
mathematical reasoning
training stability
training efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

DISPO
importance sampling decoupling
reinforcement learning for LLMs
mathematical reasoning
clipping control
🔎 Similar Papers
No similar papers found.