Token-Regulated Group Relative Policy Optimization for Stable Reinforcement Learning in Large Language Models

๐Ÿ“… 2025-10-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In GRPO, excessively large gradients for low-probability tokens destabilize training and diminish the learning contribution of high-probability tokens. To address this, we propose Token-Level Weighted GRPO (TLW-GRPO), which introduces a token-level dynamic weighting mechanism within the GRPO frameworkโ€”where weights are positively correlated with predicted token probabilities. This explicitly suppresses gradient amplification for low-probability tokens and strengthens the influence of high-confidence predictions on policy updates. TLW-GRPO integrates verifiable reward reinforcement learning without requiring auxiliary models or human annotations. Empirical evaluation across logical reasoning, mathematical problem solving, and agent-based tasks demonstrates that TLW-GRPO consistently outperforms baseline methods, significantly improving both training stability and generalization performance. These results substantiate the critical role of token-level gradient balancing in enhancing the reasoning capabilities of large language models.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful approach for strengthening the reasoning capabilities of large language models (LLMs). Among existing algorithms, Group Relative Policy Optimization (GRPO) has demonstrated strong performance, yet it suffers from a critical issue: low-probability tokens disproportionately dominate gradient updates due to their inherently large gradient magnitudes. This imbalance leads to unstable training and suppresses the contribution of high-probability tokens that are more reliable for learning. In this work, we introduce Token-Regulated Group Relative Policy Optimization (TR-GRPO), a simple yet effective extension of GRPO that assigns token-level weights positively correlated with the model's predicted probability. By downweighting low-probability tokens and emphasizing high-probability ones, TR-GRPO mitigates gradient over-amplification while preserving informative learning signals. Extensive experiments demonstrate that TR-GRPO consistently outperforms GRPO across RLVR tasks, including logic, math, and agentic reasoning, highlighting the importance of regulating token contributions during RL training and establishing TR-GRPO as a robust framework for enhancing LLM reasoning.
Problem

Research questions and friction points this paper is trying to address.

Addresses unstable training from low-probability token gradients
Mitigates gradient imbalance suppressing high-probability token contributions
Enhances reinforcement learning stability for large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level weights regulate gradient updates
Downweight low-probability tokens to stabilize training
Emphasize high-probability tokens for reliable learning
๐Ÿ”Ž Similar Papers
No similar papers found.