Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work investigates the mechanisms underlying catastrophic forgetting in large language models during supervised fine-tuning (SFT) and the relative robustness of reinforcement learning (RL) in preserving pre-existing capabilities. The study introduces, for the first time, a connection between forgetting and the preservation of internal computational circuits, proposing a novel metric termed “differential circuit fragility” to quantify the disruption of attention-head-level circuits under different fine-tuning strategies. Experiments on the Qwen2.5-3B-Instruct model using a scientific question-answering benchmark reveal that while SFT achieves faster task adaptation, it substantially degrades pre-trained circuits. In contrast, RL preserves foundational circuits more effectively at the cost of slower adaptation, demonstrating that circuit retention is a key mechanism behind RL’s superior resistance to catastrophic forgetting.

📝 Abstract

Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability.

Problem

Research questions and friction points this paper is trying to address.

catastrophic forgetting

reinforcement learning

supervised fine-tuning

computational circuits

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

catastrophic forgetting

reinforcement learning

supervised fine-tuning