Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models

πŸ“… 2025-10-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Reinforcement Learning with Verifiable Rewards (RLVR) improves mathematical and multimodal reasoning in large language models but degrades fundamental capabilities such as perception and faithfulness. To address this, we propose RECAPβ€”a dynamic objective reweighting replay strategy that adaptively adjusts multi-objective training weights by online monitoring of short-term convergence and gradient instability signals, requiring no auxiliary models or manual hyperparameter tuning. RECAP integrates KL-divergence regularization with cross-domain experience replay to enable end-to-end general knowledge retention within the RLVR framework. Experiments on Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate that RECAP significantly mitigates degradation of foundational capabilities while enhancing reasoning performance, achieving superior task trade-offs. Its core innovation lies in a lightweight, adaptive, and parameter-free dynamic reweighting mechanism.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language models. However, the RLVR recipe introduces a significant risk of capability regression, where models forget foundational skills after prolonged training without employing regularization strategies. We empirically confirm this concern, observing that open-source reasoning models suffer performance degradation on core capabilities such as perception and faithfulness. While imposing regularization terms like KL divergence can help prevent deviation from the base model, these terms are calculated on the current task, thus they do not guarantee broader knowledge. Meanwhile, commonly used experience replay across heterogeneous domains makes it nontrivial to decide how much training focus each objective should receive. To address this, we propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge preservation. Our reweighting mechanism adapts in an online manner using short-horizon signals of convergence and instability, shifting the post-training focus away from saturated objectives and toward underperforming or volatile ones. Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning. Extensive experiments on benchmarks based on Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our method, which not only preserves general capabilities but also improves reasoning by enabling more flexible trade-offs among in-task rewards.
Problem

Research questions and friction points this paper is trying to address.

Mitigating capability regression in reasoning models during reinforcement learning
Preventing foundational skills forgetting through dynamic knowledge preservation
Balancing multiple training objectives without performance degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

RECAP replay strategy with dynamic objective reweighting
Online adaptation using convergence and instability signals
End-to-end method for existing RLVR pipelines
πŸ”Ž Similar Papers
No similar papers found.
H
Hoang Phan
Meta Superintelligence Labs, New York University
X
Xianjun Yang
Meta Superintelligence Labs
K
Kevin Yao
Meta Superintelligence Labs
Jingyu Zhang
Jingyu Zhang
WNLO Huazhong University of Science and Technology
optical
S
Shengjie Bi
Meta Superintelligence Labs
X
Xiaocheng Tang
Meta Superintelligence Labs
Madian Khabsa
Madian Khabsa
GenAI, Meta
L
Lijuan Liu
Meta Superintelligence Labs
Deren Lei
Deren Lei
Meta GenAI
Natural Language ProcessingMachine Learning