HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe catastrophic forgetting and imbalanced memory/computation efficiency in continual instruction tuning of multimodal large language models (MLLMs), this paper proposes a hierarchical decoupled continual instruction learning framework. Methodologically, it dynamically activates a “task-specific expansion + task-agnostic fusion” mechanism based on inter-layer centered kernel alignment (CKA) similarity shifts; it also establishes a rigorous, information-leakage-free continual instruction evaluation benchmark. The approach achieves fine-grained decoupling across parameter updates, module expansion, and knowledge fusion—balancing adaptability and stability. Experiments demonstrate that our method improves average accuracy by 4.2% on the new benchmark, reduces GPU memory consumption by 37%, and accelerates training by 2.1×, significantly outperforming existing state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Our code will be public available.
Problem

Research questions and friction points this paper is trying to address.

Enables continual instruction tuning for Multimodal Large Language Models
Addresses memory efficiency vs. performance trade-off in existing methods
Proposes a new benchmark to evaluate method performance rigorously
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical decoupling for continual instruction tuning
Task-specific expansion and task-general fusion framework
New benchmark to evaluate performance more effectively
🔎 Similar Papers
No similar papers found.