Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It

📅 2026-02-02

📈 Citations: 1

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the instability in large language model training under reinforcement learning, which often arises from a mismatch between training and inference dynamics. The authors model this issue as a dynamic failure phenomenon and uncover its coupling mechanism with the optimization process. They propose using the length of generated responses as an early warning signal to dynamically trigger learning rate decay, thereby suppressing gradient noise and mitigating the train-inference mismatch. The resulting adaptive learning rate scheduling mechanism effectively stabilizes the training trajectory, confines the mismatch within a safe regime, and significantly enhances both the robustness and efficiency of reinforcement learning fine-tuning.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) for training Large Language Models is notoriously unstable. While recent studies attribute this to"training inference mismatch stemming"from inconsistent hybrid engines, standard remedies, such as Importance Sampling, might fail during extended training runs. In this work, we analyze this instability through the lens of optimization, demonstrating that gradient noise and training-inference mismatch escalate in tandem as training progresses. Meanwhile, we find that the mismatch can be effectively suppressed by shrinking the update size. Taken together, we deduce that the mismatch is not merely a static numerical discrepancy, but a dynamic failure coupled with the model's optimization. Based on this insight, we propose a simple yet effective solution: a specialized Learning Rate (LR) scheduler. Instead of pre-defined decay schedule in traditional LR scheduler, our method dynamically triggers LR decay based on response length, which we identify as a reliable early-warning signal for impending instability. Empirical evidence suggests that by reducing the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.

Problem

Research questions and friction points this paper is trying to address.

training-inference mismatch

reinforcement learning

large language models

optimization instability

gradient noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-inference mismatch

learning rate scheduling

reinforcement learning