🤖 AI Summary
This work investigates the mechanisms underlying catastrophic forgetting in large language models during supervised fine-tuning (SFT) and the relative robustness of reinforcement learning (RL) in preserving pre-existing capabilities. The study introduces, for the first time, a connection between forgetting and the preservation of internal computational circuits, proposing a novel metric termed “differential circuit fragility” to quantify the disruption of attention-head-level circuits under different fine-tuning strategies. Experiments on the Qwen2.5-3B-Instruct model using a scientific question-answering benchmark reveal that while SFT achieves faster task adaptation, it substantially degrades pre-trained circuits. In contrast, RL preserves foundational circuits more effectively at the cost of slower adaptation, demonstrating that circuit retention is a key mechanism behind RL’s superior resistance to catastrophic forgetting.
📝 Abstract
Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability.