MAny: Merge Anything for Multimodal Continual Instruction Tuning

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

218K/year
🤖 AI Summary
This work addresses the dual forgetting problem in multimodal large language models (MLLMs) during continual instruction tuning, which arises from perceptual drift and reasoning collapse. To tackle this issue, the authors propose MAny, a training-free analytical knowledge fusion framework that introduces two key components: Cross-modal Projection Merging (CPM) and Low-rank Parameter Merging (LPM). CPM leverages visual prototypes to guide adaptive representation alignment across modalities, while LPM efficiently fuses low-rank matrices via recursive least squares using only CPU-based algebraic operations. Evaluated on the UCIT benchmark, MAny improves final average accuracy by 8.57% and 2.85% on two prominent MLLMs, respectively, substantially enhancing both reasoning stability and perceptual consistency.

Technology Category

Application Category

📝 Abstract
Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbf{MAny} (\textbf{M}erge \textbf{Any}thing), a framework that merges task-specific knowledge through \textbf{C}ross-modal \textbf{P}rojection \textbf{M}erging (\textbf{CPM}) and \textbf{L}ow-rank \textbf{P}arameter \textbf{M}erging (\textbf{LPM}). Specifically, CPM recovers perceptual alignment by adaptively merging cross-modal visual representations via visual-prototype guidance, ensuring accurate feature recovery during inference. Simultaneously, LPM eliminates mutual interference among task-specific low-rank modules by recursively merging low-rank weight matrices. By leveraging recursive least squares, LPM provides a closed-form solution that mathematically guarantees an optimal fusion trajectory for reasoning stability. Notably, MAny operates as a training-free paradigm that achieves knowledge merging via efficient CPU-based algebraic operations, eliminating additional gradient-based optimization beyond initial tuning. Our extensive evaluations confirm the superior performance and robustness of MAny across multiple MLLMs and benchmarks. Specifically, on the UCIT benchmark, MAny achieves significant leads of up to 8.57\% and 2.85\% in final average accuracy over state-of-the-art methods across two different MLLMs, respectively.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Continual Instruction Tuning
Catastrophic Forgetting
Perception Drift
Reasoning Collapse
Multimodal Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Continual Learning
Catastrophic Forgetting
Cross-modal Projection Merging
Low-rank Parameter Merging
Training-free Knowledge Fusion
Z
Zijian Gao
College of Computer Science and Technology, National University of Defense Technology
W
Wangwang Jia
College of Computer Science and Technology, National University of Defense Technology
Xingxing Zhang
Xingxing Zhang
Tsinghua University
Machine learningOptimization
P
Pengfei Qian
College of Computer Science and Technology, National University of Defense Technology
Tao Sun
Tao Sun
National University of Defense Technology
machine learning
B
Bo Ding
College of Computer Science and Technology, National University of Defense Technology
Y
Yong Dou
College of Computer Science and Technology, National University of Defense Technology
H
Huaimin Wang
College of Computer Science and Technology, National University of Defense Technology
K
Kele Xu
College of Computer Science and Technology, National University of Defense Technology