Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the fundamental trade-off in continual learning for multimodal large language models between adapting to new tasks and preserving previously acquired knowledge, a challenge exacerbated when integrating reinforcement learning with verifiable rewards (RLVR) due to the absence of effective guidance mechanisms. The study introduces, for the first time, a formal notion of “reasoning transferability,” revealing the stability of reasoning-layer signals on out-of-distribution samples. Building on this insight, the authors propose Reasoning Transferability–based Dynamic Balanced Continual Learning (RDB-CL), which dynamically adjusts the strength of KL regularization at the sample level. This approach preserves reusable reasoning pathways while encouraging exploration of novel ones, thereby overcoming the limitations of conventional answer-level constraints. Experiments demonstrate that RDB-CL improves last-task accuracy by 12.0% over the original RLVR baseline, significantly outperforming existing continual learning methods.

📝 Abstract

Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy's behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.

Problem

Research questions and friction points this paper is trying to address.

Continual Learning

Multimodal Large Language Models

Reinforcement Learning with Verifiable Rewards

Reasoning Portability

Knowledge Retention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning Portability

Continual Learning

Multimodal Large Language Models