Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work investigates catastrophic forgetting induced by task adaptation during post-training of large language models (LLMs). Systematically comparing supervised fine-tuning (SFT) and reinforcement learning (RL), we find that RL significantly mitigates forgetting—primarily due to its online policy-driven data sampling rather than inherent algorithmic structure. Building on this insight, we propose a novel, efficient forgetting-mitigation paradigm grounded in approximating online policy data, integrated with KL regularization and advantage estimation. Extensive multi-task experiments across Llama and Qwen models demonstrate that RL achieves substantially lower forgetting and stronger generalization than SFT in instruction following, commonsense reasoning, and arithmetic tasks. Crucially, our results provide the first empirical evidence that online policy data serves as a key mechanism for preserving prior knowledge. This finding establishes a scalable, low-forgetting pathway for continual learning in LLMs.

Technology Category

Application Category

📝 Abstract

Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.

Problem

Research questions and friction points this paper is trying to address.

Mitigating catastrophic forgetting in language model post-training

Comparing forgetting patterns in supervised fine-tuning versus reinforcement learning

Investigating how on-policy data helps preserve prior knowledge during adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL uses on-policy data to reduce forgetting

Mode-seeking RL preserves prior knowledge during adaptation

Approximately on-policy data efficiently mitigates catastrophic forgetting

🔎 Similar Papers

Forgetting Order of Continual Learning: Examples That are Learned First are Forgotten Last