Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It

📅 2026-02-02
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability in large language model training under reinforcement learning, which often arises from a mismatch between training and inference dynamics. The authors model this issue as a dynamic failure phenomenon and uncover its coupling mechanism with the optimization process. They propose using the length of generated responses as an early warning signal to dynamically trigger learning rate decay, thereby suppressing gradient noise and mitigating the train-inference mismatch. The resulting adaptive learning rate scheduling mechanism effectively stabilizes the training trajectory, confines the mismatch within a safe regime, and significantly enhances both the robustness and efficiency of reinforcement learning fine-tuning.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) for training Large Language Models is notoriously unstable. While recent studies attribute this to"training inference mismatch stemming"from inconsistent hybrid engines, standard remedies, such as Importance Sampling, might fail during extended training runs. In this work, we analyze this instability through the lens of optimization, demonstrating that gradient noise and training-inference mismatch escalate in tandem as training progresses. Meanwhile, we find that the mismatch can be effectively suppressed by shrinking the update size. Taken together, we deduce that the mismatch is not merely a static numerical discrepancy, but a dynamic failure coupled with the model's optimization. Based on this insight, we propose a simple yet effective solution: a specialized Learning Rate (LR) scheduler. Instead of pre-defined decay schedule in traditional LR scheduler, our method dynamically triggers LR decay based on response length, which we identify as a reliable early-warning signal for impending instability. Empirical evidence suggests that by reducing the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.
Problem

Research questions and friction points this paper is trying to address.

training-inference mismatch
reinforcement learning
large language models
optimization instability
gradient noise
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-inference mismatch
learning rate scheduling
reinforcement learning
gradient noise
optimization dynamics
🔎 Similar Papers