LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual challenges of catastrophic forgetting and excessive parameter overhead in continual vision instruction tuning (CVIT) for multimodal large language models (MLLMs), this paper proposes an efficient architecture expansion method. Our approach introduces nested LoRA: it shares the LoRA matrix A across tasks, applies low-rank decomposition to matrix B, and incorporates cosine similarity regularization to stabilize training; additionally, task-specific module isolation ensures parameter-efficient and scalable continual learning. Experiments on multiple CVIT benchmarks demonstrate that our method reduces trainable parameters by 42% on average while consistently outperforming existing state-of-the-art methods. It achieves superior stability, generalization, and parameter efficiency—simultaneously mitigating forgetting and minimizing computational overhead.

Technology Category

Application Category

📝 Abstract
Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models (MLLMs) to incrementally learn new tasks over time. However, this process is challenged by catastrophic forgetting, where performance on previously learned tasks deteriorates as the model adapts to new ones. A common approach to mitigate forgetting is architecture expansion, which introduces task-specific modules to prevent interference. Yet, existing methods often expand entire layers for each task, leading to significant parameter overhead and poor scalability. To overcome these issues, we introduce LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method tailored for CVIT in MLLMs. LiLoRA shares the LoRA matrix A across tasks to reduce redundancy, applies an additional low-rank decomposition to matrix B to minimize task-specific parameters, and incorporates a cosine-regularized stability loss to preserve consistency in shared representations over time. Extensive experiments on a diverse CVIT benchmark show that LiLoRA consistently achieves superior performance in sequential task learning while significantly improving parameter efficiency compared to existing approaches.
Problem

Research questions and friction points this paper is trying to address.

Mitigates catastrophic forgetting in continual visual instruction tuning
Reduces parameter overhead in architecture expansion methods
Improves scalability for multimodal large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shares LoRA matrix A across tasks
Applies low-rank decomposition to matrix B
Uses cosine-regularized stability loss
🔎 Similar Papers
No similar papers found.
C
Chang Che
Hefei University of Technology
Z
Ziqi Wang
Hefei University of Technology
Pengwan Yang
Pengwan Yang
University of Amsterdam
computer vision
Q
Qi Wang
Tsinghua University
H
Hui Ma
Hefei University of Technology
Zenglin Shi
Zenglin Shi
Professor of Artificial Intelligence, Hefei University of Technology
Deep LearningComputer VisionMachine LearningMultimedia