🤖 AI Summary
This work addresses the fundamental trade-off in continual learning for multimodal large language models between adapting to new tasks and preserving previously acquired knowledge, a challenge exacerbated when integrating reinforcement learning with verifiable rewards (RLVR) due to the absence of effective guidance mechanisms. The study introduces, for the first time, a formal notion of “reasoning transferability,” revealing the stability of reasoning-layer signals on out-of-distribution samples. Building on this insight, the authors propose Reasoning Transferability–based Dynamic Balanced Continual Learning (RDB-CL), which dynamically adjusts the strength of KL regularization at the sample level. This approach preserves reusable reasoning pathways while encouraging exploration of novel ones, thereby overcoming the limitations of conventional answer-level constraints. Experiments demonstrate that RDB-CL improves last-task accuracy by 12.0% over the original RLVR baseline, significantly outperforming existing continual learning methods.
📝 Abstract
Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy's behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.