Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the limitation of supervised fine-tuning (SFT) in sparse-reward settings, where standard SFT fails to match reinforcement learning (RL) performance. We observe that SFT on carefully curated datasets implicitly implements RL and propose importance-weighted SFT (iw-SFT), a theoretically grounded variant. Leveraging the intrinsic connection between behavioral cloning and RL, iw-SFT optimizes an importance-weighted loss function that yields a tighter lower bound on the RL objective—naturally accommodating quality-labeled data. On the AIME 2024 benchmark, iw-SFT achieves 66.7% accuracy, matching state-of-the-art RL methods and substantially outperforming standard SFT. This work provides the first systematic theoretical characterization of SFT as an implicit RL procedure and introduces a lightweight, principled alternative that is both theoretically rigorous and empirically effective—applicable to large language models and continuous-control tasks.

Technology Category

Application Category

📝 Abstract

Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models; as well as for imitation learning of control policies. Here, we draw on a connection between this successful strategy and the theory and practice of finding optimal policies via Reinforcement Learning (RL). Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting. Giving support to its often observed good performance. From this viewpoint, we realize that a small modification to SFT leads to an importance weighted variant that behaves closer to training with RL as it: i) optimizes a tighter bound to the RL objective and, ii) can improve performance compared to SFT on curated data. We refer to this variant as importance weighted supervised fine-tuning (iw-SFT). We show that it is easy to implement and can be further generalized to training with quality scored data. The resulting SFT variants are competitive with more advanced RL algorithms for large language models and for training policies in continuous control tasks. For example achieving 66.7% on the AIME 2024 dataset.

Problem

Research questions and friction points this paper is trying to address.

Connects supervised fine-tuning to reinforcement learning theory

Proposes importance weighted SFT for better RL objective optimization

Improves performance on curated data versus standard SFT

Innovation

Methods, ideas, or system contributions that make the work stand out.

Connects SFT to RL theory

Introduces importance weighted SFT

Competes with advanced RL algorithms

🔎 Similar Papers

No similar papers found.