🤖 AI Summary
This work addresses the limitation of supervised fine-tuning (SFT) in sparse-reward settings, where standard SFT fails to match reinforcement learning (RL) performance. We observe that SFT on carefully curated datasets implicitly implements RL and propose importance-weighted SFT (iw-SFT), a theoretically grounded variant. Leveraging the intrinsic connection between behavioral cloning and RL, iw-SFT optimizes an importance-weighted loss function that yields a tighter lower bound on the RL objective—naturally accommodating quality-labeled data. On the AIME 2024 benchmark, iw-SFT achieves 66.7% accuracy, matching state-of-the-art RL methods and substantially outperforming standard SFT. This work provides the first systematic theoretical characterization of SFT as an implicit RL procedure and introduces a lightweight, principled alternative that is both theoretically rigorous and empirically effective—applicable to large language models and continuous-control tasks.
📝 Abstract
Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models; as well as for imitation learning of control policies. Here, we draw on a connection between this successful strategy and the theory and practice of finding optimal policies via Reinforcement Learning (RL). Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting. Giving support to its often observed good performance. From this viewpoint, we realize that a small modification to SFT leads to an importance weighted variant that behaves closer to training with RL as it: i) optimizes a tighter bound to the RL objective and, ii) can improve performance compared to SFT on curated data. We refer to this variant as importance weighted supervised fine-tuning (iw-SFT). We show that it is easy to implement and can be further generalized to training with quality scored data. The resulting SFT variants are competitive with more advanced RL algorithms for large language models and for training policies in continuous control tasks. For example achieving 66.7% on the AIME 2024 dataset.