Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning

๐Ÿ“… 2025-12-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In LLM-based reinforcement learning, numerical precision discrepancies between training and inference cause asymmetric mismatch in token probability distributions: the log-probability error is bounded above by (1โˆ’p), exacerbating bias for low-probability tail tokens and accumulating systematically along sequences, thereby severely degrading gradient estimation stability. This work is the first to formally identify and characterize this asymmetric mismatch mechanism. We propose a probability-aware dynamic vocabulary pruning method that constructs a โ€œsafe vocabularyโ€ by proactively excluding high-mismatch-risk tail tokens; it integrates sequence-level mismatch modeling with stable RL objective constraints. We theoretically prove that the optimization bias introduced by pruning is bounded. Experiments demonstrate that our approach ensures stable convergence throughout training and significantly improves policy gradient reliability.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement learning for large language models (LLMs) faces a fundamental tension: high-throughput inference engines and numerically-precise training systems produce different probability distributions from the same parameters, creating a training-inference mismatch. We prove this mismatch has an asymmetric effect: the bound on log-probability mismatch scales as $(1-p)$ where $p$ is the token probability. For high-probability tokens, this bound vanishes, contributing negligibly to sequence-level mismatch. For low-probability tokens in the tail, the bound remains large, and moreover, when sampled, these tokens exhibit systematically biased mismatches that accumulate over sequences, destabilizing gradient estimation. Rather than applying post-hoc corrections, we propose constraining the RL objective to a dynamically-pruned ``safe'' vocabulary that excludes the extreme tail. By pruning such tokens, we trade large, systematically biased mismatches for a small, bounded optimization bias. Empirically, our method achieves stable training; theoretically, we bound the optimization bias introduced by vocabulary pruning.
Problem

Research questions and friction points this paper is trying to address.

Addresses training-inference mismatch in LLM reinforcement learning
Mitigates destabilizing effects of low-probability token sampling
Proposes dynamic vocabulary pruning for stable gradient estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic vocabulary pruning excludes low-probability tail tokens
Constrains RL objective to a safe vocabulary for stability
Trades biased mismatches for small bounded optimization bias
๐Ÿ”Ž Similar Papers
No similar papers found.