DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses the challenge of efficiently endowing existing multimodal models with multilingual capabilities without relying on costly multilingual multimodal data or repeated end-to-end training. The authors propose a training-free parameter fusion method that, for the first time, incorporates a direction- and magnitude-aware mechanism to selectively fuse residual updates from multilingual and multimodal pathways within a shared language model backbone. By preserving the original vision encoder and multimodal projection modules, the approach achieves synergistic integration of heterogeneous capabilities through semantic representation reshaping at intermediate layers, maintaining compatibility with mainstream architectures such as LLaVA and Qwen. Evaluated across text-only and vision–language benchmarks covering 57 languages, the method significantly outperforms existing approaches, substantially enhancing multilingual performance while preserving strong general multimodal abilities and further boosting the effectiveness of pre-existing models.

📝 Abstract

Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.

Problem

Research questions and friction points this paper is trying to address.

multilingual

multimodal

model merging

large language models

cross-lingual alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

parameter merging

multilingual multimodal learning

training-free adaptation