On-Policy Replay for Continual Supervised Fine-Tuning

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses catastrophic forgetting in large language models during continual supervised fine-tuning, which often degrades performance on earlier tasks. The authors propose On-Policy Replay, a method that obviates the need for teacher models, auxiliary losses, or instantaneous distillation by directly leveraging the model’s own high-quality generated responses. After filtering these responses using task-specific rewards, a small fraction of historical prompt–response pairs is replayed and integrated into the standard supervised fine-tuning pipeline. Notably, this approach pioneers the use of on-policy signals as a source of training data and demonstrates that the replay distribution—not response quality—is the key factor in mitigating forgetting. Evaluated on the TRACE benchmark, the method achieves substantial gains over a carefully tuned Vanilla Replay baseline with only a 1% replay budget, improving the backward transfer (BWT) metric by up to 46%.

📝 Abstract

Continual supervised fine-tuning (SFT) is the de facto recipe for adapting large language models (LLMs) to a stream of downstream tasks, but it suffers from catastrophic forgetting of earlier capabilities. Recent work shows that on-policy signals -- training on the model's own outputs -- reduce forgetting more reliably than off-policy supervision. Existing on-policy methods route this signal through a new training objective (e.g., self-distillation losses with a teacher copy), inheriting an extra forward pass, schedule sensitivity, and stylistic drift from the teacher.We instead route the on-policy signal through the training data source. Our method, On-Policy Replay (OPR), rolls out the most recent checkpoint on a small budget of historical prompts, filters the generations by a task reward, and replays the surviving (prompt, model response) pairs as ordinary SFT examples. There is no teacher, no auxiliary loss, and no on-the-fly distillation. Across three 7--8B instruction-tuned backbones (Qwen2.5-7B-Instruct, Qwen3-8B, Llama3.1-8B-Instruct) on the TRACE continual-learning benchmark, OPR consistently reduces forgetting; on the sharpest stress test (Qwen2.5-7B-Instruct, Sequential SFT BWT -13.93), OPR lifts BWT to -0.65 at a 10% replay budget and to -2.29 at a 1% budget -- a 46% reduction in |BWT| over a tuned Vanilla Replay baseline, with 42--46% reductions observed across all three backbones. We give a KL-shrinkage interpretation that places OPR and prior on-policy distillation methods on a single axis, and we present a counterintuitive finding that explains why Vanilla Replay is already a strong baseline: low-score replay is uniformly worse than Vanilla Replay, demonstrating that the active ingredient in OPR is the on-policy distribution, not the response quality alone.Our code is available at https://github.com/Yancey2024/OnPolicyReplay.

Problem

Research questions and friction points this paper is trying to address.

continual learning

catastrophic forgetting

supervised fine-tuning

large language models

on-policy learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

On-Policy Replay

Continual Supervised Fine-Tuning

Catastrophic Forgetting