GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the issue of distributional collapse in supervised fine-tuning (SFT) during post-training of large reasoning models, which restricts the exploration space necessary for effective reinforcement learning (RL). To bridge SFT and RL, the authors propose a unified post-training framework grounded in Gibbs initialization at finite temperature (GIFT). Viewing SFT as the zero-temperature limit, GIFT introduces a finite-temperature energy potential to model supervision signals, thereby constructing a distributional continuum between SFT and RL. Inspired by Gibbs distributions in statistical physics, this approach ensures objective consistency across training stages and theoretically supports an optimization path toward global optimality. Experiments demonstrate that GIFT significantly outperforms standard SFT and other baselines across multiple benchmarks, yielding improved RL performance and enhanced convergence stability.

Technology Category

Application Category

📝 Abstract

The prevailing post-training paradigm for Large Reasoning Models (LRMs)--Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)--suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT within a unified post-training framework and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero-temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite-temperature energy potential, establishing a distributional bridge that ensures objective consistency throughout the post-training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL initialization, providing a mathematically principled pathway toward achieving global optimality in post-training. Our code is available at https://github.com/zzy1127/GIFT.

Problem

Research questions and friction points this paper is trying to address.

post-training

optimization mismatch

distributional collapse

Supervised Fine-Tuning

Reinforcement Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gibbs Initialization

Finite Temperature

Post-Training Optimization