DGRO: Enhancing LLM Reasoning via Exploration-Exploitation Control and Reward Variance Management

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the core challenges of exploration-exploitation imbalance and high reward variance impeding convergence in reinforcement learning (RL) for long-chain reasoning with large language models (LLMs), this paper proposes a decoupled population reward optimization framework. Methodologically, it integrates online policy mirror descent (OPMD) with direct reward optimization (DRO), underpinned by rigorous theoretical analysis, ablation studies, and evaluation across multiple mathematical reasoning benchmarks. Crucially, it is the first to decouple the regularization term into two independent hyperparameters: policy gradient scaling and sampling policy distance—enabling finer-grained control over optimization dynamics. Theoretically, it quantifies the impact of reward variance on reasoning performance. Empirically, the method achieves 96.9% accuracy on the Logic dataset—substantially surpassing state-of-the-art—and demonstrates strong generalization on GSM8K and MATH.

Technology Category

Application Category

📝 Abstract
Inference scaling further accelerates Large Language Models (LLMs) toward Artificial General Intelligence (AGI), with large-scale Reinforcement Learning (RL) to unleash long Chain-of-Thought reasoning. Most contemporary reasoning approaches usually rely on handcrafted rule-based reward functions. However, the tarde-offs of exploration and exploitation in RL algorithms involves multiple complex considerations, and the theoretical and empirical impacts of manually designed reward functions remain insufficiently explored. In this paper, we propose Decoupled Group Reward Optimization (DGRO), a general RL algorithm for LLM reasoning. On the one hand, DGRO decouples the traditional regularization coefficient into two independent hyperparameters: one scales the policy gradient term, and the other regulates the distance from the sampling policy. This decoupling not only enables precise control over balancing exploration and exploitation, but also can be seamlessly extended to Online Policy Mirror Descent (OPMD) algorithms in Kimi k1.5 and Direct Reward Optimization. On the other hand, we observe that reward variance significantly affects both convergence speed and final model performance. We conduct both theoretical analysis and extensive empirical validation to assess DGRO, including a detailed ablation study that investigates its performance and optimization dynamics. Experimental results show that DGRO achieves state-of-the-art performance on the Logic dataset with an average accuracy of 96.9%, and demonstrates strong generalization across mathematical benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Balancing exploration-exploitation in RL for LLM reasoning
Managing reward variance to improve model convergence
Enhancing long Chain-of-Thought reasoning in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples policy gradient and sampling distance hyperparameters
Manages reward variance to improve convergence and performance
Extends to Online Policy Mirror Descent algorithms
Xuerui Su
Xuerui Su
Ph.D of BJTU
Machine LearningReinforcement Learning
Liya Guo
Liya Guo
Tsinghua University
Y
Yue Wang
Zhongguancun Academy
Y
Yi Zhu
Yau Mathematical Sciences Center, Tsinghua University, Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, China
Z
Zhiming Ma
Academy of Mathematics and Systems Science, Chinese Academy of Sciences
Z
Zun Wang
Microsoft Research AI4Science, Beijing, China
Y
Yuting Liu
School of Mathematics and Statistics, Beijing Jiaotong University