TMS: Trajectory-Mixed Supervision for Reward-Free, On-Policy SFT

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limitations of supervised fine-tuning (SFT), which often suffers from catastrophic forgetting due to mismatches between static labels and the evolving model policy, as well as the high cost and instability of reinforcement learning (RL). To overcome these challenges, the authors propose Trajectory-Mixed Supervision (TMS), a novel framework that introduces a reward-free dynamic supervision mechanism. TMS constructs an online curriculum using historical model checkpoints and leverages a Policy-Label Divergence (PLD) metric to guide fine-tuning, effectively mitigating mode collapse and preserving pre-trained capabilities. Experiments demonstrate that TMS significantly outperforms both standard and iterative SFT on benchmarks such as MATH and GSM8K, achieving performance comparable to RL-based methods, while PLD accurately predicts the degree of forgetting during training.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) are the two dominant paradigms for enhancing Large Language Model (LLM) performance on downstream tasks. While RL generally preserves broader model capabilities (retention) better than SFT, it comes with significant costs: complex reward engineering, instability, and expensive on-policy sampling. In contrast, SFT is efficient but brittle, often suffering from catastrophic forgetting due to $\textbf{Supervision Mismatch}$: the divergence between the model's evolving policy and static training labels. We address this trade-off with $\textbf{Trajectory-Mixed Supervision (TMS)}$, a reward-free framework that approximates the on-policy benefits of RL by creating a dynamic curriculum from the model's own historical checkpoints. TMS minimizes $\textit{Policy-Label Divergence (PLD)}$, preventing the mode collapse that drives forgetting in standard SFT. Experiments across reasoning (MATH, GSM8K) and instruction-following benchmarks demonstrate that TMS effectively shifts the accuracy--retention Pareto frontier. While RL remains the gold standard for retention, TMS significantly outperforms standard and iterative SFT, bridging the gap to RL without requiring reward models or verifiers. Mechanistic analysis confirms that PLD drift accurately predicts forgetting and that TMS successfully mitigates this drift.

Problem

Research questions and friction points this paper is trying to address.

Supervision Mismatch

Catastrophic Forgetting

Policy-Label Divergence

Reward-Free Learning

Retention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trajectory-Mixed Supervision

Policy-Label Divergence

Supervision Mismatch