🤖 AI Summary
Existing reinforcement learning approaches for mathematical reasoning with large language models struggle to balance efficiency and stability: PPO-based methods are stable but suffer from slow convergence, while REINFORCE-based methods are efficient yet prone to instability. This work proposes DISPO, an algorithm that, for the first time, decouples the clipping of importance sampling weights based on whether model responses are correct or incorrect, introducing a four-parameter controllable clipping mechanism. Within a REINFORCE framework, this enables fine-grained policy updates that effectively balance exploration and distillation. The method substantially improves both training stability and learning efficiency, achieving a 61.04% accuracy on AIME'24—outperforming CISPO (55.42%) and DAPO (50.21%)—and demonstrates consistent gains across multiple models and benchmarks.
📝 Abstract
Reinforcement learning with verifiable rewards has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models particularly in mathematics. Current approaches in this domain present a clear trade-off: PPO-style methods (e.g., GRPO/DAPO) offer training stability but exhibit slow learning trajectories due to their trust-region constraints on policy updates, while REINFORCE-style approaches (e.g., CISPO) demonstrate improved learning efficiency but suffer from performance instability as they clip importance sampling weights while still permitting non-zero gradients outside the trust-region. To address these limitations, we introduce DISPO, a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses, yielding four controllable policy update regimes. Through targeted ablations, we uncover how each regime impacts training: for correct responses, weights>1 increase the average token entropy (i.e., exploration) while weights<1 decrease it (i.e., distillation) -- both beneficial but causing gradual performance degradation when excessive. For incorrect responses, overly restrictive clipping triggers sudden performance collapse through repetitive outputs (when weights>1) or vanishing response lengths (when weights<1). By separately tuning these four clipping parameters, DISPO maintains the exploration-distillation balance while preventing catastrophic failures, achieving 61.04% on AIME'24 (vs. 55.42% CISPO and 50.21% DAPO) with similar gains across various benchmarks and models.