🤖 AI Summary
This work addresses the challenges of modality missingness, task interference, and catastrophic forgetting faced by large-scale multimodal models in continuous data streams. To this end, we propose Dual-Decomposed LoRA Experts (DD-LoRA), a novel architecture that dynamically constructs LoRA update matrices through a decoupled pool of modality-specific factors. By integrating a task-partitioning framework, cross-modal guided routing, and a task-key memory mechanism, our approach enables efficient and stable continual learning. As the first to introduce dual-decomposed low-rank structures into continual learning under missing modalities, DD-LoRA significantly mitigates interference between modalities and tasks, supports task-agnostic inference, and effectively prevents forgetting. Extensive experiments on mainstream CMML benchmarks demonstrate substantial performance gains over current state-of-the-art methods, validating the efficacy of architecture-aware LoRA design in real-world multimodal scenarios.
📝 Abstract
Adapting Large Multimodal Models (LMMs) to real-world scenarios poses the dual challenges of learning from sequential data streams while handling frequent modality incompleteness, a task known as Continual Missing Modality Learning (CMML). However, existing works on CMML have predominantly relied on prompt tuning, a technique that struggles with this task due to cross-task interference between its learnable prompts in their shared embedding space. A naive application of Low-Rank Adaptation (LoRA) with modality-shared module will also suffer modality interference from competing gradients. To this end, we propose DeLo, the first framework to leverage a novel dual-decomposed low-rank expert architecture for CMML. Specifically, this architecture resolves modality interference through decomposed LoRA expert, dynamically composing LoRA update matrix with rank-one factors from disentangled modality-specific factor pools. Embedded within a task-partitioned framework that structurally prevents catastrophic forgetting, this expert system is supported by two key mechanisms: a Cross-Modal Guided Routing strategy to handle incomplete data and a Task-Key Memory for efficient, task-agnostic inference. Extensive experiments on established CMML benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches. This highlights the value of a principled, architecturally-aware LoRA design for real-world multimodal challenges.