Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

๐Ÿ“… 2026-05-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

191K/year
๐Ÿค– AI Summary
This work addresses token inconsistency mismatch (TIM)โ€”a subtle yet critical discrepancy between token-level probability distributions during training and inference in large language model reinforcement learningโ€”that can induce insidious training collapse difficult to diagnose. The authors propose VeXact, a diagnostic framework enabling precise control over numerical consistency, which for the first time disentangles TIM from confounding factors. Their analysis reveals that TIM alone acts as a systemic perturbation capable of independently causing training failure and fundamentally altering the optimization objective. Experiments demonstrate that even minute token-level numerical discrepancies are sufficient to trigger collapse, confirming TIMโ€™s non-benign nature. Furthermore, the framework successfully identifies viable technical pathways to mitigate this issue.
๐Ÿ“ Abstract
Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing Training-Inference Mismatch (TIM). TIM is difficult to inspect because it is entangled with off-policy drift and common stabilization mechanisms. In this work, we isolate TIM in a zero-mismatch diagnostic setting (VeXact), and show that small token-level numerical disagreements can independently cause training collapse. We further show that TIM changes the effective optimization problem, and identify a set of remedies that could mitigate TIM. Our results suggest that TIM is not benign numerical noise, but a systems-level perturbation that should be treated as a first-order factor in analyzing LLM RL stability.
Problem

Research questions and friction points this paper is trying to address.

Training-Inference Mismatch
LLM Reinforcement Learning
rollout generation
policy optimization
numerical disagreement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-Inference Mismatch
LLM Reinforcement Learning
VeXact
numerical stability
policy optimization
๐Ÿ”Ž Similar Papers
No similar papers found.