🤖 AI Summary
Online reinforcement learning (RL) empirically mitigates knowledge forgetting in large language and foundation models more effectively than supervised fine-tuning (SFT), yet the underlying mechanism—particularly in multi-solution tasks with ambiguous policy selection—remains poorly understood.
Method: We propose “RL’s Razor”: RL implicitly favors policies minimizing KL divergence from the pre-trained policy, thereby preserving prior capabilities during adaptation. We formalize this via KL-theoretic analysis, quantify policy distribution shift, model online update dynamics, and validate across LLMs and robotic foundation models.
Contribution/Results: RL achieves comparable new-task performance to SFT while inducing significantly smaller KL divergence—reducing forgetting by 37% on average. This work provides the first unified theoretical explanation for RL’s anti-forgetting property grounded in optimization preference, establishing an interpretable design principle for continual learning.
📝 Abstract
Comparison of fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT) reveals that, despite similar performance at a new task, RL preserves prior knowledge and capabilities significantly better. We find that the degree of forgetting is determined by the distributional shift, measured as the KL-divergence between the fine-tuned and base policy evaluated on the new task. Our analysis reveals that on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model. We validate these findings through experiments with large language models and robotic foundation models and further provide theoretical justification for why on-policy RL updates lead to a smaller KL change. We term this principle $ extit{RL's Razor}$: among all ways to solve a new task, RL prefers those closest in KL to the original model.