🤖 AI Summary
Current multimodal language models predominantly rely on a single generic visual encoder, struggling to balance domain-specific expertise with cross-domain generalization. To address this, we propose a multi-visual-encoder mixture architecture featuring a novel soft routing mechanism—requiring neither fine-tuning nor image patching—that dynamically dispatches input images to the most suitable pre-trained specialized encoders (e.g., UniChat, InternViT, Texify) via gated Mixture-of-Experts (MoE). Our method integrates heterogeneous encoders, lightweight adapter interfaces, and a zero-shot domain selection strategy, achieving unified expertise and generalization without parameter inflation. Evaluated on ChartQA, MMBench, and MMMU, it attains state-of-the-art or near-state-of-the-art performance while supporting end-to-end high-resolution inference—significantly improving both efficiency and accuracy.
📝 Abstract
Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. While existing approaches commonly rely on a single pre-trained vision encoder, there is a great variability of specialized encoders that can boost model's performance in distinct domains. In this work, we propose MOVE (Mixture of Vision Encoders) a simple yet effective approach to leverage multiple pre-trained encoders for specialized multimodal tasks. MOVE automatically routes inputs to the most appropriate encoder among candidates such as Unichat, InternViT, and Texify, thereby enhancing performance across a diverse set of benchmarks, including ChartQA, MMBench, and MMMU. Experimental results demonstrate that MOVE achieves competitive accuracy without incurring the complexities of image slicing for high-resolution images.