🤖 AI Summary
In reinforcement learning (RL) training of large language models (LLMs), gradients of low-probability tokens—exhibiting excessively large magnitudes—dominate and suppress effective updates of high-probability tokens, leading to gradient imbalance and degraded model performance. To address this, we propose the Gradient Balance Correction (GBC) mechanism, introducing two synergistic techniques: Advantage-based Reweighting (Adv-Reweight), which dynamically rescales token-level gradients using a token-wise advantage function, and Low-Probability Token Isolation (Lopti), which identifies and attenuates gradients of low-probability tokens via a probability threshold. Together, these enable fine-grained, token-level gradient balancing. Integrated into the GRPO framework, GBC achieves a 46.2% performance gain on the K&K logic puzzle benchmark, while significantly improving reasoning stability and accelerating training convergence.
📝 Abstract
Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at https://github.com/zhyang2226/AR-Lopti.