$λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GRPO suffers from length bias—uniform token-level advantage allocation dilutes gradient updates for longer responses. Existing improvements (e.g., DAPO, Dr. GRPO) rely on heuristic aggregation schemes and lack interpretable, principled token-level preference modeling. Method: We propose Adaptive GRPO, which introduces a learnable scalar λ to automatically modulate token-specific weights during loss aggregation. This enables data-driven, token-level preference learning within the original GRPO framework—without auxiliary reward models or hand-crafted rules. The method preserves GRPO’s architecture and incurs zero additional computational or data overhead. Contribution/Results: On multiple mathematical reasoning benchmarks, Adaptive GRPO boosts average accuracy of Qwen2.5 models by 1.0–1.9% over GRPO and DAPO. It significantly enhances training robustness and provides improved interpretability through adaptive, learned token weighting.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Human Feedback (RLHF) has been the dominant approach for improving the reasoning capabilities of Large Language Models (LLMs). Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has simplified this paradigm by replacing the reward and value models with rule-based verifiers. A prominent example is Group Relative Policy Optimization (GRPO). However, GRPO inherently suffers from a length bias, since the same advantage is uniformly assigned to all tokens of a response. As a result, longer responses distribute the reward over more tokens and thus contribute disproportionately to gradient updates. Several variants, such as DAPO and Dr. GRPO, modify the token-level aggregation of the loss, yet these methods remain heuristic and offer limited interpretability regarding their implicit token preferences. In this work, we explore the possibility of allowing the model to learn its own token preference during optimization. We unify existing frameworks under a single formulation and introduce a learnable parameter $λ$ that adaptively controls token-level weighting. We use $λ$-GRPO to denote our method, and we find that $λ$-GRPO achieves consistent improvements over vanilla GRPO and DAPO on multiple mathematical reasoning benchmarks. On Qwen2.5 models with 1.5B, 3B, and 7B parameters, $λ$-GRPO improves average accuracy by $+1.9%$, $+1.0%$, and $+1.7%$ compared to GRPO, respectively. Importantly, these gains come without any modifications to the training data or additional computational cost, highlighting the effectiveness and practicality of learning token preferences.
Problem

Research questions and friction points this paper is trying to address.

Addressing length bias in GRPO by learning token preferences
Unifying heuristic RLHF variants under adaptive token weighting
Improving mathematical reasoning without extra data or computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces learnable parameter for token-level weighting
Unifies existing frameworks under single formulation
Learns token preferences during optimization adaptively
🔎 Similar Papers
No similar papers found.