Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance degradation of multimodal models on purely linguistic tasks—such as grammatical understanding—this paper proposes a parameter fusion method that jointly enhances both multimodal and unimodal language capabilities under low-resource, developmentally grounded child language data. Specifically, we employ weighted linear interpolation to fuse parameters from a pretrained unimodal language model and a multimodal foundation model, followed by fine-tuning on a small, interpretable developmental multimodal dataset. This approach requires no architectural modifications or task-specific adaptations. It effectively mitigates modality interference while achieving balanced competence: it significantly outperforms baselines on language-specific benchmarks (e.g., BabyLM) while preserving vision–language alignment. Our work establishes a novel paradigm for developing trustworthy, capability-balanced multimodal foundation models informed by developmental science.

Technology Category

Application Category

📝 Abstract
State-of-the-art vision-and-language models consist of many parameters and learn from enormous datasets, surpassing the amounts of linguistic data that children are exposed to as they acquire a language. This paper presents our approach to the multimodal track of the BabyLM challenge addressing this discrepancy. We develop language-only and multimodal models in low-resource settings using developmentally plausible datasets, with our multimodal models outperforming previous BabyLM baselines. One finding in the multimodal language model literature is that these models tend to underperform in extit{language-only} tasks. Therefore, we focus on maintaining language-only abilities in multimodal models. To this end, we experiment with extit{model merging}, where we fuse the parameters of multimodal models with those of language-only models using weighted linear interpolation. Our results corroborate the findings that multimodal models underperform in language-only benchmarks that focus on grammar, and model merging with text-only models can help alleviate this problem to some extent, while maintaining multimodal performance.
Problem

Research questions and friction points this paper is trying to address.

Multimodal models underperform in language-only grammar tasks
Model merging maintains language abilities in multimodal systems
Addressing performance gap between text-only and multimodal models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model merging fuses multimodal and language-only parameters
Weighted linear interpolation maintains language-only abilities
Developmental plausible datasets train low-resource multimodal models
🔎 Similar Papers
No similar papers found.