🤖 AI Summary
This work addresses the issue of distributional collapse in supervised fine-tuning (SFT) during post-training of large reasoning models, which restricts the exploration space necessary for effective reinforcement learning (RL). To bridge SFT and RL, the authors propose a unified post-training framework grounded in Gibbs initialization at finite temperature (GIFT). Viewing SFT as the zero-temperature limit, GIFT introduces a finite-temperature energy potential to model supervision signals, thereby constructing a distributional continuum between SFT and RL. Inspired by Gibbs distributions in statistical physics, this approach ensures objective consistency across training stages and theoretically supports an optimization path toward global optimality. Experiments demonstrate that GIFT significantly outperforms standard SFT and other baselines across multiple benchmarks, yielding improved RL performance and enhanced convergence stability.
📝 Abstract
The prevailing post-training paradigm for Large Reasoning Models (LRMs)--Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)--suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT within a unified post-training framework and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero-temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite-temperature energy potential, establishing a distributional bridge that ensures objective consistency throughout the post-training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL initialization, providing a mathematically principled pathway toward achieving global optimality in post-training. Our code is available at https://github.com/zzy1127/GIFT.