Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

In reinforcement learning (RL) training of large language models (LLMs), gradients of low-probability tokens—exhibiting excessively large magnitudes—dominate and suppress effective updates of high-probability tokens, leading to gradient imbalance and degraded model performance. To address this, we propose the Gradient Balance Correction (GBC) mechanism, introducing two synergistic techniques: Advantage-based Reweighting (Adv-Reweight), which dynamically rescales token-level gradients using a token-wise advantage function, and Low-Probability Token Isolation (Lopti), which identifies and attenuates gradients of low-probability tokens via a probability threshold. Together, these enable fine-grained, token-level gradient balancing. Integrated into the GRPO framework, GBC achieves a 46.2% performance gain on the K&K logic puzzle benchmark, while significantly improving reasoning stability and accelerating training convergence.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at https://github.com/zhyang2226/AR-Lopti.

Problem

Research questions and friction points this paper is trying to address.

Low-probability tokens dominate RL training updates

High-probability token gradients are suppressed in LLMs

Unbalanced token updates hinder RL training efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Advantage Reweighting balances token gradient updates

Low-Probability Token Isolation (Lopti) attenuates disruptive gradients

Enhances GRPO by prioritizing high-probability token learning

🔎 Similar Papers

No similar papers found.