STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a critical yet previously overlooked issue in reinforcement learning fine-tuning of large language models: training instability and inference degradation are frequently triggered by an extremely small fraction (approximately 0.01%) of rare, spurious tokens. The study demonstrates that these tokens destabilize training by adversely affecting policy gradients and reducing local policy entropy. To address this, the authors propose STAPO, a novel method that analyzes token-level policy gradients to mask gradient updates from spurious tokens and re-normalizes the loss over valid tokens. Unlike prior approaches relying on heuristic regularization, STAPO operates without such constraints and achieves an average improvement of 7.13% across six mathematical reasoning benchmarks on Qwen-1.7B, -8B, and -14B models, significantly outperforming GRPO, 20-Entropy, and JustRL while maintaining more stable entropy throughout training.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-scale model refining, which selectively masks such updates and renormalizes the loss over valid tokens. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13\% over GRPO, 20-Entropy and JustRL.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Training Instability
Spurious Tokens
Large Language Models
Policy Gradient
Innovation

Methods, ideas, or system contributions that make the work stand out.

spurious tokens
policy gradient stability
reinforcement learning for LLMs
STAPO
entropy regularization
🔎 Similar Papers
No similar papers found.
S
Shiqi Liu
School of Vehicle and Mobility & College of AI, Tsinghua University
Zeyu He
Zeyu He
Ph.D. Student, Penn State University
Natural Language ProcessingHCICrowdsourcing
G
Guojian Zhan
School of Vehicle and Mobility & College of AI, Tsinghua University
L
Letian Tao
School of Vehicle and Mobility & College of AI, Tsinghua University
Z
Zhilong Zheng
School of Vehicle and Mobility & College of AI, Tsinghua University
J
Jiang Wu
School of Vehicle and Mobility & College of AI, Tsinghua University
Yinuo Wang
Yinuo Wang
Tsinghua University
LLMReinforcement LearningAutonomous DrivingDiffusion Model
Yang Guan
Yang Guan
Software Engineer, Google Inc.
Networks
K
Kehua Sheng
DiDi Voyager Labs, DiDi Autonomous Driving
Bo Zhang
Bo Zhang
Meituan
MLLMModel CompressionAutoMLComputer Vision
Keqiang Li
Keqiang Li
Department of Automotive Engineering, Tsinghua University
Intelligent VehiclesAdvanced Driver Assistant Systems
Jingliang Duan
Jingliang Duan
University of Science and Technology Beijing
S
Shengbo Eben Li
School of Vehicle and Mobility & College of AI, Tsinghua University