Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

175K/year
🤖 AI Summary
This work addresses the suboptimal performance of Group Relative Policy Optimization (GRPO) caused by an imbalance between exploration and exploitation during training. To this end, we propose a covariance-weighted optimization method that requires no additional hyperparameters. Our approach introduces, for the first time, the covariance between token probabilities and advantage functions into the policy gradient update, combined with Gaussian kernel smoothing to reweight advantages. This mechanism dynamically suppresses excessive updates of extreme tokens while preserving informative learning signals, thereby automatically modulating exploration intensity. Experimental results demonstrate that the proposed method significantly outperforms vanilla GRPO across multiple reasoning benchmarks, achieving improved downstream task performance and effectively stabilizing policy entropy throughout training.
📝 Abstract
Group Relative Policy Optimization (GRPO) has emerged as a promising approach for improving the reasoning capabilities of large language models. However, it struggles to effectively balance the tradeoff between exploration and exploitation during training, often resulting in suboptimal performance. Motivated by the theoretical insight that changes in entropy are governed by the covariance between token probabilities and their corresponding advantages, we propose a hyperparameter-free, covariance-weighted optimization method that dynamically down-weights extreme token-level updates via a Gaussian kernel. This approach automatically reduces the instability caused by exploration-exploitation trade-off while preserving informative learning signals. Extensive empirical evaluations show that our approach improves downstream performance across reasoning benchmarks compared with GRPO, and effectively stablizes entropy as training progresses.
Problem

Research questions and friction points this paper is trying to address.

exploration-exploitation trade-off
policy optimization
large language models
training instability
entropy control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Covariance-Aware Optimization
Gaussian-Kernel Advantage Reweighting
Extreme Token Suppression
Entropy Stabilization
Group Relative Policy Optimization
🔎 Similar Papers
No similar papers found.