🤖 AI Summary
To address inefficient multimodal information fusion across multiple vision encoders—which limits the visual understanding capability of multimodal large language models (MLLMs)—this paper proposes LEO, a dual-branch vision encoder framework. LEO introduces adaptive image patching and post-adaptation token interleaving fusion, enabling dynamic integration of vision tokens from heterogeneous ViT encoders without modifying the model architecture or training pipeline. Its core contributions are: (1) the first sequence-level vision token interleaving fusion paradigm; and (2) plug-and-play cross-domain transferability—e.g., zero-shot adaptation to autonomous driving scenarios achieves competitive performance out-of-the-box. Evaluated on 13 mainstream vision-language benchmarks, LEO consistently surpasses existing open-source and hybrid MLLMs, delivering significant gains in fine-grained visual understanding and cross-modal alignment.
📝 Abstract
Enhanced visual understanding serves as a cornerstone for multimodal large language models (MLLMs). Recent hybrid MLLMs incorporate a mixture of vision experts to address the limitations of using a single vision encoder and excessively long visual tokens. Despite the progress of these MLLMs, a research gap remains in effectively integrating diverse vision encoders. This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO, a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling: for each segmented tile of the input images, LEO sequentially interleaves the visual tokens from its two vision encoders. Extensive evaluation across 13 vision-language benchmarks reveals that LEO outperforms state-of-the-art open-source MLLMs and hybrid MLLMs on the majority of tasks. Furthermore, we show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe, achieving competitive performance compared to existing baselines. The code and model will be publicly available.