🤖 AI Summary
Existing multimodal large language models (MLLMs) rely on resource-intensive full-parameter fine-tuning, suffering from poor flexibility and severe catastrophic forgetting. This paper proposes MMER, a training-agnostic modality expansion framework that preserves the frozen LLM backbone and reuses pretrained multimodal encoders. Leveraging a novel parameter merging and binary masking mechanism, MMER dynamically decouples modality-specific parameters, enabling zero-shot modality extension without any parameter updates. Crucially, it supports seamless integration of new modalities while strictly retaining original language capabilities—maintaining 99% of zero-shot performance—and substantially mitigating forgetting. Extensive evaluations across multimodal benchmarks demonstrate consistent and significant improvements over state-of-the-art baselines. MMER establishes a new paradigm for efficient, scalable MLLM development, eliminating the need for costly fine-tuning while ensuring robust cross-modal generalization and backward compatibility.
📝 Abstract
Fine-tuning Large Language Models (LLMs) with multimodal encoders on modality-specific data expands the modalities that LLMs can handle, leading to the formation of Multimodal LLMs (MLLMs). However, this paradigm heavily relies on resource-intensive and inflexible fine-tuning from scratch with new multimodal data. In this paper, we propose MMER (Multi-modality Expansion and Retention), a training-free approach that integrates existing MLLMs for effective multimodal expansion while retaining their original performance. Specifically, MMER reuses MLLMs' multimodal encoders while merging their LLM parameters. By comparing original and merged LLM parameters, MMER generates binary masks to approximately separate LLM parameters for each modality. These decoupled parameters can independently process modality-specific inputs, reducing parameter conflicts and preserving original MLLMs' fidelity. MMER can also mitigate catastrophic forgetting by applying a similar process to MLLMs fine-tuned on new tasks. Extensive experiments show significant improvements over baselines, proving that MMER effectively expands LLMs' multimodal capabilities while retaining 99% of the original performance, and also markedly mitigates catastrophic forgetting.