Fine-tuning MLLMs Without Forgetting Is Easier Than You Think

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the catastrophic forgetting of out-of-distribution data in multimodal large language models during fine-tuning, which stems primarily from over-adaptation to the target task. To mitigate this issue, the authors propose a simple yet effective strategy that combines parameter constraints, low learning rate fine-tuning, and mixed training across tasks and datasets. Through a systematic 2×2 evaluation framework assessing model performance both in- and out-of-distribution, experiments demonstrate that the proposed approach significantly alleviates forgetting on visual question answering tasks. Notably, it outperforms existing methods that rely on complex auxiliary mechanisms in continual learning scenarios, thereby challenging the prevailing assumption that sophisticated architectures are necessary to prevent catastrophic forgetting.

Technology Category

Application Category

📝 Abstract

The paper demonstrate that simple adjustments of the fine-tuning recipes of multimodal large language models (MLLM) are sufficient to mitigate catastrophic forgetting. On visual question answering, we design a 2x2 experimental framework to assess model performance across in-distribution and out-of-distribution image and text inputs. Our results show that appropriate regularization, such as constraining the number of trainable parameters or adopting a low learning rate, effectively prevents forgetting when dealing with out-of-distribution images. However, we uncover a distinct form of forgetting in settings with in-distribution images and out-of-distribution text. We attribute this forgetting as task-specific overfitting and address this issue by introducing a data-hybrid training strategy that combines datasets and tasks. Finally, we demonstrate that this approach naturally extends to continual learning, outperforming existing methods with complex auxiliary mechanisms. In general, our findings challenge the prevailing assumptions by highlighting the inherent robustness of MLLMs and providing practical guidelines for adapting them while preserving their general capabilities.

Problem

Research questions and friction points this paper is trying to address.

catastrophic forgetting

multimodal large language models

fine-tuning

out-of-distribution

continual learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

catastrophic forgetting

multimodal large language models

data-hybrid training