On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Standard supervised fine-tuning (SFT) exhibits weaker generalization than reinforcement learning (RL) in large language models, primarily due to implicit reward structure bias: its fixed target objective induces token-level gradient imbalance. To address this, we propose Dynamic Fine-Tuning (DFT), a lightweight, differentiable method that dynamically rescales gradients via a probability-aware mechanism—introducing minimal, adaptive target adjustments directly into the loss function. Implemented with only a single-line code modification, DFT stabilizes per-token updates without architectural changes or additional inference overhead. Empirically, DFT consistently outperforms standard SFT across diverse general-purpose and reasoning benchmarks (e.g., MMLU, GSM8K, HumanEval). It enhances generalization across multiple base models—including Llama-3 and Qwen—and achieves performance on par with state-of-the-art offline RL methods on RL-specific tasks. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

Problem

Research questions and friction points this paper is trying to address.

Improving generalization of Supervised Fine-Tuning in LLMs

Rectifying problematic reward structure in SFT gradients

Proposing Dynamic Fine-Tuning to stabilize gradient updates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Fine-Tuning (DFT) improves SFT generalization

Rescales objective function with token probability

Stabilizes gradient updates for each token

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL