GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This work addresses key limitations of supervised fine-tuning (SFT) in post-training large language models, including single-path dependency, entropy collapse, and gradient explosion, which hinder both knowledge-efficient injection and robust generalization. To overcome these challenges, the authors propose Group Fine-Tuning (GFT), a unified post-training framework that constructs diverse response groups to generate normalized contrastive supervision signals. GFT incorporates unbiased group advantage learning and a dynamic inverse propensity weighting mechanism to effectively mitigate reward sparsity and stabilize optimization. Theoretical analysis reveals, for the first time, that SFT is a special case of policy gradient methods with extremely sparse implicit rewards. Experimental results demonstrate that GFT significantly outperforms conventional SFT across multiple evaluation dimensions and enables smoother integration with subsequent reinforcement learning stages.

Technology Category

Application Category

📝 Abstract
Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.
Problem

Research questions and friction points this paper is trying to address.

supervised fine-tuning
reward sparsity
inverse-probability weighting
entropy collapse
gradient explosion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Fine-Tuning
Group Advantage Learning
Dynamic Coefficient Rectification
Reward Sparsity
Policy Gradient Optimization