Defeating the Training-Inference Mismatch via FP16

πŸ“… 2025-10-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language models (LLMs) frequently suffer from numerical mismatches during reinforcement learning (RL) fine-tuning due to inconsistent floating-point precisions between training (e.g., BF16) and inference (e.g., FP16), leading to optimization instability and slow convergence. We identify BF16 rounding error as the root cause. To address this, we propose a lightweight solution: adopting FP16 uniformly across the entire training and inference pipeline. Our method requires no architectural modifications, algorithmic changes, or hyperparameter tuningβ€”only native FP16 support available in modern deep learning frameworks. Extensive experiments across diverse RLHF tasks, algorithms (e.g., PPO, DPO), and inference frameworks (e.g., Hugging Face Transformers, vLLM) demonstrate substantial improvements in training stability, faster convergence, and enhanced final policy performance. The approach is highly generalizable and plug-and-play, requiring only minimal implementation effort while delivering consistent gains across settings.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to extbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Addressing numerical instability in RL fine-tuning of LLMs
Eliminating training-inference mismatch through FP16 precision
Improving stability and performance across diverse RL tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses FP16 precision to eliminate training-inference mismatch
Simple code change without architecture modifications
Improves stability convergence and performance uniformly
πŸ”Ž Similar Papers
No similar papers found.