FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning

πŸ“… 2026-05-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

232K/year
πŸ€– AI Summary
This work addresses the disconnect between multitask pretraining and subsequent continual learning in multimodal models by proposing a scalable sparse Mixture-of-Experts (MoE) framework. The approach introduces a modality-aware router to handle heterogeneous inputs and compresses expert knowledge into a low-rank memory subspace. By expanding only lightweight routers while keeping the backbone capacity fixed, the framework enables efficient task-incremental continual learning. It effectively mitigates catastrophic forgetting and substantially improves parameter efficiency. Extensive experiments on multiple medical multimodal benchmarks demonstrate the model’s ability to continuously adapt to new tasks while preserving pretrained performance.
πŸ“ Abstract
Real-world model deployment across multiple domains requires multimodal models to operate under two complementary regimes: (1) multi-task pretraining, tasks are co-available at design time where related tasks could borrow representational strength from one another, (2) continual adaptation, in which new tasks emerge after deployment with previously unseen modality combinations. However, neither regime alone suffices: the pretraining task set is never exhaustive, while bypassing joint training forfeits the transfer gains and efficiency among co-trainable tasks. Sparse Mixture-of-Experts (MoE) is a natural fit for this dual requirement: sparse activation enables modular capacity expansion as new tasks arrive, while routing decouples modality-level computation from task-level composition. In this work, we propose a scalable MoE framework for multitask pretraining and continual learning across flexible modality combinations. The framework is designed to support training on multimodal tasks with diverse modality configurations by leveraging modality-specific routers that process tokens from each modality across tasks. Furthermore, it enables continual learning over sequential multimodal tasks within a fixed-capacity MoE by compressing accumulated expert knowledge into low-rank memory subspaces, while expanding only the lightweight routers. We validate the effectiveness of our method on multiple healthcare multimodal benchmarks. It demonstrates competitive multitask pretraining performance while alleviating catastrophic forgetting and improving parameter efficiency.
Problem

Research questions and friction points this paper is trying to address.

continual learning
multimodal learning
multi-task learning
catastrophic forgetting
modality combinations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Continual Learning
Multimodal Learning
Modality-Specific Routing
Low-Rank Memory
X
Xing Han
Department of Computer Science, Johns Hopkins University
Shravan Chaudhari
Shravan Chaudhari
CS PhD Student at Johns Hopkins University
Domain AdaptationOOD DetectionComputer VisionGraph Neural Networks
T
Tanvi Ranade
Department of Computer Science, Johns Hopkins University
Rama Chellappa
Rama Chellappa
Bloomberg Distinguished Professor, Johns Hopkins University
Image Analysisartificial intelligencebiometricsComputer VisionBiomedical Data Science
S
Suchi Saria
Department of Computer Science, Johns Hopkins University, Bayesian Health