MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

📅 2025-02-21

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Current multimodal language models predominantly rely on a single generic visual encoder, struggling to balance domain-specific expertise with cross-domain generalization. To address this, we propose a multi-visual-encoder mixture architecture featuring a novel soft routing mechanism—requiring neither fine-tuning nor image patching—that dynamically dispatches input images to the most suitable pre-trained specialized encoders (e.g., UniChat, InternViT, Texify) via gated Mixture-of-Experts (MoE). Our method integrates heterogeneous encoders, lightweight adapter interfaces, and a zero-shot domain selection strategy, achieving unified expertise and generalization without parameter inflation. Evaluated on ChartQA, MMBench, and MMMU, it attains state-of-the-art or near-state-of-the-art performance while supporting end-to-end high-resolution inference—significantly improving both efficiency and accuracy.

Technology Category

Application Category

📝 Abstract

Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. While existing approaches commonly rely on a single pre-trained vision encoder, there is a great variability of specialized encoders that can boost model's performance in distinct domains. In this work, we propose MOVE (Mixture of Vision Encoders) a simple yet effective approach to leverage multiple pre-trained encoders for specialized multimodal tasks. MOVE automatically routes inputs to the most appropriate encoder among candidates such as Unichat, InternViT, and Texify, thereby enhancing performance across a diverse set of benchmarks, including ChartQA, MMBench, and MMMU. Experimental results demonstrate that MOVE achieves competitive accuracy without incurring the complexities of image slicing for high-resolution images.

Problem

Research questions and friction points this paper is trying to address.

Enhances multimodal language models using multiple vision encoders.

Automatically selects optimal encoder for specialized tasks.

Improves performance across diverse benchmarks without image slicing.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes multiple pre-trained vision encoders

Automatically routes inputs to appropriate encoders

Enhances performance across diverse benchmarks

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs