Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning

📅 2025-12-28

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

In LLM-based reinforcement learning, numerical precision discrepancies between training and inference cause asymmetric mismatch in token probability distributions: the log-probability error is bounded above by (1−p), exacerbating bias for low-probability tail tokens and accumulating systematically along sequences, thereby severely degrading gradient estimation stability. This work is the first to formally identify and characterize this asymmetric mismatch mechanism. We propose a probability-aware dynamic vocabulary pruning method that constructs a “safe vocabulary” by proactively excluding high-mismatch-risk tail tokens; it integrates sequence-level mismatch modeling with stable RL objective constraints. We theoretically prove that the optimization bias introduced by pruning is bounded. Experiments demonstrate that our approach ensures stable convergence throughout training and significantly improves policy gradient reliability.

Technology Category

Application Category

📝 Abstract

Reinforcement learning for large language models (LLMs) faces a fundamental tension: high-throughput inference engines and numerically-precise training systems produce different probability distributions from the same parameters, creating a training-inference mismatch. We prove this mismatch has an asymmetric effect: the bound on log-probability mismatch scales as $(1-p)$ where $p$ is the token probability. For high-probability tokens, this bound vanishes, contributing negligibly to sequence-level mismatch. For low-probability tokens in the tail, the bound remains large, and moreover, when sampled, these tokens exhibit systematically biased mismatches that accumulate over sequences, destabilizing gradient estimation. Rather than applying post-hoc corrections, we propose constraining the RL objective to a dynamically-pruned ``safe'' vocabulary that excludes the extreme tail. By pruning such tokens, we trade large, systematically biased mismatches for a small, bounded optimization bias. Empirically, our method achieves stable training; theoretically, we bound the optimization bias introduced by vocabulary pruning.

Problem

Research questions and friction points this paper is trying to address.

Addresses training-inference mismatch in LLM reinforcement learning

Mitigates destabilizing effects of low-probability token sampling

Proposes dynamic vocabulary pruning for stable gradient estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic vocabulary pruning excludes low-probability tail tokens

Constrains RL objective to a safe vocabulary for stability

Trades biased mismatches for small bounded optimization bias

🔎 Similar Papers

Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement