🤖 AI Summary
To address catastrophic forgetting in multimodal continual learning, this paper proposes a cross-modal adapter framework based on a Mixture-of-Experts (MoE) architecture. The method jointly optimizes cross-modal representation alignment loss and historical representation relation regularization to simultaneously absorb new-task knowledge and retain old-task knowledge. Innovatively, it integrates modality-specific adaptation, expert routing, and representation consistency constraints into pretrained models, enabling dynamic incremental fusion of heterogeneous multimodal inputs (e.g., vision and language). Evaluated on both class-incremental and domain-incremental multimodal continual learning benchmarks, the approach consistently outperforms existing state-of-the-art methods: it achieves average accuracy gains of 3.2–5.7 percentage points and reduces forgetting rates by 18.4–31.6%, demonstrating superior trade-offs between knowledge transferability and learning stability.
📝 Abstract
Continual learning is essential for adapting models to new tasks while retaining previously acquired knowledge. While existing approaches predominantly focus on uni-modal data, multi-modal learning offers substantial benefits by utilizing diverse sensory inputs, akin to human perception. However, multi-modal continual learning presents additional challenges, as the model must effectively integrate new information from various modalities while preventing catastrophic forgetting. In this work, we propose a pre-trained model-based framework for multi-modal continual learning. Our framework includes a novel cross-modality adapter with a mixture-of-experts structure to facilitate effective integration of multi-modal information across tasks. We also introduce a representation alignment loss that fosters learning of robust multi-modal representations, and regularize relationships between learned representations to preserve knowledge from previous tasks. Experiments on several multi-modal datasets demonstrate that our approach consistently outperforms baselines in both class-incremental and domain-incremental learning, achieving higher accuracy and reduced forgetting.