Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In reinforcement learning (RL) training of large language models (LLMs), gradients of low-probability tokens—exhibiting excessively large magnitudes—dominate and suppress effective updates of high-probability tokens, leading to gradient imbalance and degraded model performance. To address this, we propose the Gradient Balance Correction (GBC) mechanism, introducing two synergistic techniques: Advantage-based Reweighting (Adv-Reweight), which dynamically rescales token-level gradients using a token-wise advantage function, and Low-Probability Token Isolation (Lopti), which identifies and attenuates gradients of low-probability tokens via a probability threshold. Together, these enable fine-grained, token-level gradient balancing. Integrated into the GRPO framework, GBC achieves a 46.2% performance gain on the K&K logic puzzle benchmark, while significantly improving reasoning stability and accelerating training convergence.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at https://github.com/zhyang2226/AR-Lopti.
Problem

Research questions and friction points this paper is trying to address.

Low-probability tokens dominate RL training updates
High-probability token gradients are suppressed in LLMs
Unbalanced token updates hinder RL training efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Advantage Reweighting balances token gradient updates
Low-Probability Token Isolation (Lopti) attenuates disruptive gradients
Enhances GRPO by prioritizing high-probability token learning
🔎 Similar Papers
No similar papers found.
Zhihe Yang
Zhihe Yang
The Chinese University of Hong Kong
Offline RLRLHFLLMLMM
X
Xufang Luo
Microsoft Research Asia, Shanghai, China.
Z
Zilong Wang
Microsoft Research Asia, Shanghai, China.
D
Dongqi Han
Microsoft Research Asia, Shanghai, China.
Z
Zhiyuan He
Microsoft Research Asia, Shanghai, China.
D
Dongsheng Li
Microsoft Research Asia, Shanghai, China.
Yunjian Xu
Yunjian Xu
The Chinese University of Hong Kong
Deep reinforcement learningPower systemsElectricity marketsOperations ResearchStochastic optimal control