RL's Razor: Why Online Reinforcement Learning Forgets Less

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Online reinforcement learning (RL) empirically mitigates knowledge forgetting in large language and foundation models more effectively than supervised fine-tuning (SFT), yet the underlying mechanism—particularly in multi-solution tasks with ambiguous policy selection—remains poorly understood. Method: We propose “RL’s Razor”: RL implicitly favors policies minimizing KL divergence from the pre-trained policy, thereby preserving prior capabilities during adaptation. We formalize this via KL-theoretic analysis, quantify policy distribution shift, model online update dynamics, and validate across LLMs and robotic foundation models. Contribution/Results: RL achieves comparable new-task performance to SFT while inducing significantly smaller KL divergence—reducing forgetting by 37% on average. This work provides the first unified theoretical explanation for RL’s anti-forgetting property grounded in optimization preference, establishing an interpretable design principle for continual learning.

Technology Category

Application Category

📝 Abstract

Comparison of fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT) reveals that, despite similar performance at a new task, RL preserves prior knowledge and capabilities significantly better. We find that the degree of forgetting is determined by the distributional shift, measured as the KL-divergence between the fine-tuned and base policy evaluated on the new task. Our analysis reveals that on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model. We validate these findings through experiments with large language models and robotic foundation models and further provide theoretical justification for why on-policy RL updates lead to a smaller KL change. We term this principle $ extit{RL's Razor}$: among all ways to solve a new task, RL prefers those closest in KL to the original model.

Problem

Research questions and friction points this paper is trying to address.

RL preserves prior knowledge better than SFT

Distributional shift determines degree of forgetting

On-policy RL biased towards KL-minimal solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL preserves prior knowledge better than SFT

On-policy RL biased towards KL-minimal solutions

RL's Razor prefers solutions closest to original model

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL