🤖 AI Summary
Unified multimodal generative models (UMGMs) suffer from both intra-modal and inter-modal catastrophic forgetting during continual learning, with the latter long overlooked. This paper is the first to systematically identify and quantify inter-modal forgetting. We propose the lightweight Modality-Decoupled Expert Architecture (MDEA), which mitigates cross-modal interference via modality-specific experts, gradient-decoupled parameter updates, and parameter isolation. To preserve pretrained capabilities, MDEA incorporates noise-free knowledge distillation. Unlike prior approaches, it requires no input perturbation and scales efficiently to new modalities and tasks. Evaluated on multiple multimodal continual learning benchmarks, MDEA significantly alleviates both intra- and inter-modal forgetting, consistently outperforming state-of-the-art methods across all metrics. The implementation is publicly available.
📝 Abstract
Unified Multimodal Generative Models (UMGMs) unify visual understanding and image generation within a single autoregressive framework. However, their ability to continually learn new tasks is severely hindered by catastrophic forgetting, both within a modality (intra-modal) and across modalities (inter-modal). While intra-modal forgetting has been studied in prior continual learning (CL) work, inter-modal forgetting remains largely unexplored. In this paper, we identify and empirically validate this phenomenon in UMGMs and provide a theoretical explanation rooted in gradient conflict between modalities. To address both intra- and inter-modal forgetting, we propose Modality-Decoupled Experts (MoDE), a lightweight and scalable architecture that isolates modality-specific updates to mitigate the gradient conflict and leverages knowledge distillation to prevent catastrophic forgetting and preserve pre-trained capabilities. Unlike previous CL methods that remain modality-coupled and suffer from modality gradient conflict, MoDE explicitly decouples modalities to prevent interference. Experiments across diverse benchmarks demonstrate that MoDE significantly mitigates both inter- and intra-modal forgetting, outperforming prior CL baselines in unified multimodal generation settings. Codes will be publicly available: https://github.com/Christina200/MoDE-official.git