🤖 AI Summary
Supervised fine-tuning (SFT) of large language models suffers from high-variance importance sampling and training instability due to distributional mismatch between the behavior policy (e.g., outputs from a teacher or initial model) and the target policy in off-policy learning. To address this, we propose Guided Re-solving: a data rewriting framework that actively narrows the policy gap by regenerating erroneous responses to better align with the target policy’s distribution—thereby preemptively harmonizing the training distribution. Our method integrates off-policy learning, importance sampling, KL regularization, and a dynamic rewriting mechanism. Evaluated on five mathematical reasoning benchmarks, Guided Re-solving significantly outperforms standard SFT and state-of-the-art direct fine-tuning (DFT) approaches. Results demonstrate substantial improvements in training stability, variance reduction, and generalization capability, validating both its effectiveness and conceptual novelty.
📝 Abstract
Supervised fine-tuning (SFT) of large language models can be viewed as an off-policy learning problem, where expert demonstrations come from a fixed behavior policy while training aims to optimize a target policy. Importance sampling is the standard tool for correcting this distribution mismatch, but large policy gaps lead to high variance and training instability. Existing approaches mitigate this issue using KL penalties or clipping, which passively constrain updates rather than actively reducing the gap. We propose a simple yet effective data rewriting framework that proactively shrinks the policy gap by keeping correct solutions as on-policy data and rewriting incorrect ones with guided re-solving, falling back to expert demonstrations only when needed. This aligns the training distribution with the target policy before optimization, reducing importance sampling variance and stabilizing off-policy fine-tuning. Experiments on five mathematical reasoning benchmarks demonstrate consistent and significant gains over both vanilla SFT and the state-of-the-art Dynamic Fine-Tuning (DFT) approach. The data and code will be released at https://github.com/NKU-HLT/Off-Policy-SFT.