Multi-Modal Continual Learning via Cross-Modality Adapters and Representation Alignment with Knowledge Preservation

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address catastrophic forgetting in multimodal continual learning, this paper proposes a cross-modal adapter framework based on a Mixture-of-Experts (MoE) architecture. The method jointly optimizes cross-modal representation alignment loss and historical representation relation regularization to simultaneously absorb new-task knowledge and retain old-task knowledge. Innovatively, it integrates modality-specific adaptation, expert routing, and representation consistency constraints into pretrained models, enabling dynamic incremental fusion of heterogeneous multimodal inputs (e.g., vision and language). Evaluated on both class-incremental and domain-incremental multimodal continual learning benchmarks, the approach consistently outperforms existing state-of-the-art methods: it achieves average accuracy gains of 3.2–5.7 percentage points and reduces forgetting rates by 18.4–31.6%, demonstrating superior trade-offs between knowledge transferability and learning stability.

Technology Category

Application Category

📝 Abstract
Continual learning is essential for adapting models to new tasks while retaining previously acquired knowledge. While existing approaches predominantly focus on uni-modal data, multi-modal learning offers substantial benefits by utilizing diverse sensory inputs, akin to human perception. However, multi-modal continual learning presents additional challenges, as the model must effectively integrate new information from various modalities while preventing catastrophic forgetting. In this work, we propose a pre-trained model-based framework for multi-modal continual learning. Our framework includes a novel cross-modality adapter with a mixture-of-experts structure to facilitate effective integration of multi-modal information across tasks. We also introduce a representation alignment loss that fosters learning of robust multi-modal representations, and regularize relationships between learned representations to preserve knowledge from previous tasks. Experiments on several multi-modal datasets demonstrate that our approach consistently outperforms baselines in both class-incremental and domain-incremental learning, achieving higher accuracy and reduced forgetting.
Problem

Research questions and friction points this paper is trying to address.

Preventing catastrophic forgetting in multi-modal continual learning systems
Integrating new information from diverse sensory inputs across tasks
Maintaining knowledge from previous tasks while adapting to new ones
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modality adapter with mixture-of-experts structure
Representation alignment loss for robust multimodal learning
Regularizing relationships between representations to preserve knowledge
🔎 Similar Papers
E
Evelyn Chee
School of Computing, National University of Singapore
W
W. Hsu
School of Computing, National University of Singapore
Mong Li Lee
Mong Li Lee
Professor of Computer Science, National University of Singapore
Database systemsData managementData analytics