Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates catastrophic forgetting in continual post-training (CPT), specifically comparing supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). We find that RFT inherently preserves knowledge, maintaining or even enhancing general capabilities (e.g., MMMU, MMLU-Pro) across multi-task continual learning. We attribute this advantage to implicit KL regularization emerging during policy optimization and propose a rollout-based instance filtering algorithm to improve RFT’s training stability and efficiency. To enable systematic evaluation, we introduce the first CPT benchmark tailored for multimodal tasks, integrating chain-of-thought reasoning and KL divergence analysis. Experiments on a seven-stage sequential task setup demonstrate that our RFT method matches the performance of full multi-task learning—without requiring memory replay or parameter isolation mechanisms.

Technology Category

Application Category

📝 Abstract
Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model's general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis shows that explicit mechanisms, such as KL penalty and chain-of-thought reasoning, are not the primary factors. Instead, we find that the implicit regularization inherent to RFT is a key factor in mitigating forgetting. Finally, we propose a rollout-based instance filtering algorithm to improve the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.
Problem

Research questions and friction points this paper is trying to address.

Compares SFT and RFT impacts on knowledge retention in continual post-training
Investigates catastrophic forgetting in SFT versus knowledge preservation in RFT
Explores implicit regularization in RFT as key to mitigating forgetting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement fine-tuning mitigates catastrophic forgetting
Implicit regularization in RFT preserves prior knowledge
Rollout-based filtering enhances RFT stability and efficiency
🔎 Similar Papers
No similar papers found.
S
Song Lai
Centre for Artificial Intelligence and Robotics, HKISI, CAS; City University of Hong Kong
H
Haohan Zhao
Centre for Artificial Intelligence and Robotics, HKISI, CAS; City University of Hong Kong
R
Rong Feng
Centre for Artificial Intelligence and Robotics, HKISI, CAS; City University of Hong Kong
C
Changyi Ma
Centre for Artificial Intelligence and Robotics, HKISI, CAS
W
Wenzhuo Liu
Institute of Automation, CAS; University of Chinese Academy of Sciences
H
Hongbo Zhao
Institute of Automation, CAS; University of Chinese Academy of Sciences
X
Xi Lin
City University of Hong Kong
Dong Yi
Dong Yi
Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences
Computer VisionPattern Recognition
M
Min Xie
City University of Hong Kong
Qingfu Zhang
Qingfu Zhang
Chair Professor, FIEEE, City University of Hong Kong
evolutionary computationmultiobjective optimizationcomputational intelligence
H
Hongbin Liu
Centre for Artificial Intelligence and Robotics, HKISI, CAS; Institute of Automation, CAS; University of Chinese Academy of Sciences
G
Gaofeng Meng
Centre for Artificial Intelligence and Robotics, HKISI, CAS; Institute of Automation, CAS; University of Chinese Academy of Sciences
F
Fei Zhu
Centre for Artificial Intelligence and Robotics, HKISI, CAS; Institute of Automation, CAS; University of Chinese Academy of Sciences