๐ค AI Summary
In LLM-based reinforcement learning, numerical precision discrepancies between training and inference cause asymmetric mismatch in token probability distributions: the log-probability error is bounded above by (1โp), exacerbating bias for low-probability tail tokens and accumulating systematically along sequences, thereby severely degrading gradient estimation stability. This work is the first to formally identify and characterize this asymmetric mismatch mechanism. We propose a probability-aware dynamic vocabulary pruning method that constructs a โsafe vocabularyโ by proactively excluding high-mismatch-risk tail tokens; it integrates sequence-level mismatch modeling with stable RL objective constraints. We theoretically prove that the optimization bias introduced by pruning is bounded. Experiments demonstrate that our approach ensures stable convergence throughout training and significantly improves policy gradient reliability.
๐ Abstract
Reinforcement learning for large language models (LLMs) faces a fundamental tension: high-throughput inference engines and numerically-precise training systems produce different probability distributions from the same parameters, creating a training-inference mismatch. We prove this mismatch has an asymmetric effect: the bound on log-probability mismatch scales as $(1-p)$ where $p$ is the token probability. For high-probability tokens, this bound vanishes, contributing negligibly to sequence-level mismatch. For low-probability tokens in the tail, the bound remains large, and moreover, when sampled, these tokens exhibit systematically biased mismatches that accumulate over sequences, destabilizing gradient estimation. Rather than applying post-hoc corrections, we propose constraining the RL objective to a dynamically-pruned ``safe'' vocabulary that excludes the extreme tail. By pruning such tokens, we trade large, systematically biased mismatches for a small, bounded optimization bias. Empirically, our method achieves stable training; theoretically, we bound the optimization bias introduced by vocabulary pruning.