DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization

📅 2025-12-06

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Gradient conflicts, training instability, and low sample efficiency hinder GRPO’s effectiveness in long-chain reasoning with large language models (LLMs). Method: We propose Distinctiveness-aware GRPO (DaGRPO), the first method to identify gradient conflict origins from a sample distinctiveness perspective. It introduces a fine-grained, score-driven sequence-level dynamic gradient masking mechanism to suppress conflicting gradients, and integrates high-quality anchor-guided off-policy data augmentation to recover reasoning signals for hard examples. DaGRPO unifies sequence-level reward modeling, sample reweighting, and joint SFT-RL optimization. Results: Evaluated on nine mathematical reasoning and out-of-distribution generalization benchmarks, DaGRPO achieves an average accuracy gain of 4.7% over SFT, GRPO, and hybrid baselines. It significantly mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.

Technology Category

Application Category

📝 Abstract

The evolution of Large Language Models (LLMs) has catalyzed a paradigm shift from superficial instruction following to rigorous long-horizon reasoning. While Group Relative Policy Optimization (GRPO) has emerged as a pivotal mechanism for eliciting such post-training reasoning capabilities due to its exceptional performance, it remains plagued by significant training instability and poor sample efficiency. We theoretically identify the root cause of these issues as the lack of distinctiveness within on-policy rollouts: for routine queries, highly homogeneous samples induce destructive gradient conflicts; whereas for hard queries, the scarcity of valid positive samples results in ineffective optimization. To bridge this gap, we propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO). DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification, which utilizes fine-grained scoring to dynamically mask sample pairs with low distinctiveness, thereby eradicating gradient conflicts at the source; and (2) Off-policy Data Augmentation, which introduces high-quality anchors to recover training signals for challenging tasks. Extensive experiments across 9 mathematical reasoning and out-of-distribution (OOD) generalization benchmarks demonstrate that DaGRPO significantly surpasses existing SFT, GRPO, and hybrid baselines, achieving new state-of-the-art performance (e.g., a +4.7% average accuracy gain on math benchmarks). Furthermore, in-depth analysis confirms that DaGRPO effectively mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Addresses gradient conflicts in LLM reasoning training

Improves sample efficiency for hard query optimization

Enhances training stability in group relative policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic masking of low distinctiveness sample pairs

Off-policy data augmentation with high-quality anchors

Rectifies gradient conflicts to enhance training stability

🔎 Similar Papers

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation