How to Teach Large Multimodal Models New Skills

๐Ÿ“… 2025-10-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses catastrophic forgetting in large multimodal models (LMMs) during continual acquisition of new skills. To mitigate capability drift while preserving prior knowledge, we propose an efficient selective fine-tuning method. By analyzing the correlation between output token distribution shifts and forgetting during fine-tuning, we identify that updating only self-attention projection layers or MLP gating layers suffices to significantly reduce forgetting. We further introduce a counting-bias probe to quantify forgetting and integrate it with a hierarchical parameter freezing strategy for precise knowledge retention. Evaluated across five newly introduced skills, our method achieves substantial performance gains, while maintaining near-original accuracy on eight retained tasksโ€”average degradation is under 1.2%. The approach demonstrates robustness and generalizability across multiple mainstream LMM families (e.g., LLaVA, Qwen-VL, InternVL), offering a scalable, low-overhead solution for sustainable capability expansion in multimodal foundation models.

Technology Category

Application Category

๐Ÿ“ Abstract
How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent"forgetting"on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL
Problem

Research questions and friction points this paper is trying to address.

Teaching new skills to multimodal models without forgetting prior abilities
Studying sequential fine-tuning effects on model performance across benchmarks
Developing tuning methods to maintain skills while learning new tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning only self-attention projection layers
Updating MLP Gate&Up while freezing Down projection
Limiting output distribution drift to prevent forgetting
๐Ÿ”Ž Similar Papers
No similar papers found.
Zhen Zhu
Zhen Zhu
University of Illinois at Urbana-Champaign
Computer VisionDeep Learning
Y
Yiming Gong
University of Illinois Urbana-Champaign
Y
Yao Xiao
University of Illinois Urbana-Champaign
Y
Yaoyao Liu
University of Illinois Urbana-Champaign
Derek Hoiem
Derek Hoiem
Professor of Computer Science, University of Illinois