Multi-Modal Continual Learning via Cross-Modality Adapters and Representation Alignment with Knowledge Preservation

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address catastrophic forgetting in multimodal continual learning, this paper proposes a cross-modal adapter framework based on a Mixture-of-Experts (MoE) architecture. The method jointly optimizes cross-modal representation alignment loss and historical representation relation regularization to simultaneously absorb new-task knowledge and retain old-task knowledge. Innovatively, it integrates modality-specific adaptation, expert routing, and representation consistency constraints into pretrained models, enabling dynamic incremental fusion of heterogeneous multimodal inputs (e.g., vision and language). Evaluated on both class-incremental and domain-incremental multimodal continual learning benchmarks, the approach consistently outperforms existing state-of-the-art methods: it achieves average accuracy gains of 3.2–5.7 percentage points and reduces forgetting rates by 18.4–31.6%, demonstrating superior trade-offs between knowledge transferability and learning stability.

Technology Category

Application Category

📝 Abstract

Continual learning is essential for adapting models to new tasks while retaining previously acquired knowledge. While existing approaches predominantly focus on uni-modal data, multi-modal learning offers substantial benefits by utilizing diverse sensory inputs, akin to human perception. However, multi-modal continual learning presents additional challenges, as the model must effectively integrate new information from various modalities while preventing catastrophic forgetting. In this work, we propose a pre-trained model-based framework for multi-modal continual learning. Our framework includes a novel cross-modality adapter with a mixture-of-experts structure to facilitate effective integration of multi-modal information across tasks. We also introduce a representation alignment loss that fosters learning of robust multi-modal representations, and regularize relationships between learned representations to preserve knowledge from previous tasks. Experiments on several multi-modal datasets demonstrate that our approach consistently outperforms baselines in both class-incremental and domain-incremental learning, achieving higher accuracy and reduced forgetting.

Problem

Research questions and friction points this paper is trying to address.

Preventing catastrophic forgetting in multi-modal continual learning systems

Integrating new information from diverse sensory inputs across tasks

Maintaining knowledge from previous tasks while adapting to new ones

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modality adapter with mixture-of-experts structure

Representation alignment loss for robust multimodal learning

Regularizing relationships between representations to preserve knowledge

🔎 Similar Papers

What to align in multimodal contrastive learning?