Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

📅 2024-10-10
🏛️ arXiv.org
📈 Citations: 11
Influential: 0
📄 PDF
🤖 AI Summary
To address optimization instability and catastrophic forgetting in monolithic multimodal large language model (MLLM) pretraining, this paper proposes the Endogenous Vision Pretraining (EViP) paradigm. EViP freezes the language model parameters and enables progressive vision expert learning via end-to-end joint modeling and a multimodal Mixture-of-Experts (MoE) architecture—starting from noisy data and gradually advancing to high-quality data to fully exploit visual knowledge, while avoiding parameter interference inherent in conventional vision–language alignment. Experiments demonstrate that EViP achieves state-of-the-art performance among monolithic MLLMs on 13 of 16 benchmarks (e.g., outperforming Emu3 by 80 points on OCR-Bench) and matches the performance of modular SOTA models such as InternVL-1.5, while reducing first-token latency by 67%.

Technology Category

Application Category

📝 Abstract
In this paper, we focus on monolithic Multimodal Large Language Models (MLLMs) that integrate visual encoding and language decoding into a single LLM. In particular, we identify that existing pre-training strategies for monolithic MLLMs often suffer from unstable optimization or catastrophic forgetting. To address this issue, our core idea is to embed a new visual parameter space into a pre-trained LLM, thereby stably learning visual knowledge from noisy data while freezing the LLM. Based on this principle, we present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure. Moreover, we propose an innovative pre-training strategy to maximize the visual capability of Mono-InternVL, namely Endogenous Visual Pre-training (EViP). In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data. To validate our approach, we conduct extensive experiments on 16 benchmarks. Experimental results confirm the superior performance of Mono-InternVL than existing monolithic MLLMs on 13 of 16 multimodal benchmarks, e.g., +80 points over Emu3 on OCRBench. Compared to the modular baseline, i.e., InternVL-1.5, Mono-InternVL still retains comparable multimodal performance while reducing up to 67% first token latency. Code and model are released at https://github.com/OpenGVLab/Mono-InternVL.
Problem

Research questions and friction points this paper is trying to address.

Addresses unstable optimization in monolithic MLLMs
Introduces Endogenous Visual Pre-training for visual knowledge
Reduces latency while maintaining multimodal performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embeds visual parameter space into pre-trained LLM
Uses multimodal mixture-of-experts structure
Implements Endogenous Visual Pre-training strategy
🔎 Similar Papers
No similar papers found.