DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses three key challenges in reinforcement learning (RL) for large reasoning models (LRMs): (1) policy bias induced by problem difficulty, (2) training instability due to entropy collapse, and (3) ineffective learning under severe positive–negative sample imbalance. To tackle these, we propose Discriminative Constrained Optimization (DisCO), the first framework to incorporate discriminative learning into LRM RL training—replacing group-wise relative objectives with a learnable scoring function. DisCO employs a non-clipped RL objective coupled with an explicit KL-divergence constraint, jointly mitigating difficulty-induced bias, alleviating entropy collapse, and inherently accommodating imbalanced data. Evaluated on six mathematical reasoning benchmarks, our 1.5B model achieves +7% average improvement over GRPO and +6% over DAPO, with markedly enhanced training stability and generalization.

Technology Category

Application Category

📝 Abstract
The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias. We also identify a connection between GRPO and traditional discriminative methods in supervised learning. Motivated by these insights, we introduce a new Discriminative Constrained Optimization (DisCO) framework for reinforcing LRMs, grounded in the principle of discriminative learning. The main differences between DisCO and GRPO and its recent variants are: (1) it replaces the group relative objective with a discriminative objective defined by a scoring function; (2) it abandons clipping-based surrogates in favor of non-clipping RL surrogate objectives used as scoring functions; (3) it employs a simple yet effective constrained optimization approach to enforce the KL divergence constraint, ensuring stable training. As a result, DisCO offers notable advantages over GRPO and its variants: (i) it completely eliminates difficulty bias by adopting discriminative objectives; (ii) it addresses the entropy instability in GRPO and its variants through the use of non-clipping scoring functions and a constrained optimization approach; (iii) it allows the incorporation of advanced discriminative learning techniques to address data imbalance, where a significant number of questions have more negative than positive generated answers during training. Our experiments on enhancing the mathematical reasoning capabilities of SFT-finetuned models show that DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO across six benchmark tasks for an 1.5B model.
Problem

Research questions and friction points this paper is trying to address.

Addressing difficulty bias in large reasoning models
Improving stability with non-clipping scoring functions
Enhancing performance using discriminative learning techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses discriminative objective with scoring function
Employs non-clipping RL surrogate objectives
Applies constrained optimization for KL divergence
🔎 Similar Papers
No similar papers found.