Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

📅 2024-12-06
📈 Citations: 6
Influential: 2
📄 PDF
🤖 AI Summary
To address the performance limitations of open-source multimodal large language models (MLLMs) on complex vision-language understanding tasks, this work introduces the InternVL 2.5 series. Methodologically, it pioneers joint scaling of the visual encoder and language model, integrated with multi-stage alignment training, high-quality cross-modal data curation, test-time Chain-of-Thought reasoning, and dynamic ensemble re-ranking. The key contributions are threefold: (1) It achieves the first open-source MLLM result exceeding 70% on the MMMU benchmark (70.1%, +3.7 points), matching the performance of GPT-4o and Claude-3.5-Sonnet; (2) It systematically characterizes the joint scaling laws among visual encoder capacity, language model size, data scale, and test-time strategies; and (3) It provides the first empirical validation of strong test-time scaling effectiveness in open-source MLLMs. All models and an interactive Hugging Face demo are publicly released.

Technology Category

Application Category

📝 Abstract
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL
Problem

Research questions and friction points this paper is trying to address.

Multimodal Modeling
Visual Understanding
Text Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Modeling
Performance Enhancement
Artificial Intelligence Advancement
🔎 Similar Papers
No similar papers found.
Z
Zhe Chen
Nanjing University, Shanghai AI Laboratory
Weiyun Wang
Weiyun Wang
Shanghai AI Laboratory; Fudan University
Vision-Language ModelMLLMFoundation Model
Y
Yue Cao
Nanjing University, Shanghai AI Laboratory
Y
Yangzhou Liu
Nanjing University, Shanghai AI Laboratory
Z
Zhangwei Gao
Shanghai Jiao Tong University, Shanghai AI Laboratory
Erfei Cui
Erfei Cui
Shanghai AI Laboratory; Shanghai JiaoTong University
Computer Vision
J
Jinguo Zhu
Shanghai AI Laboratory
S
Shenglong Ye
Shanghai AI Laboratory
H
Hao Tian
SenseTime Research
Zhaoyang Liu
Zhaoyang Liu
Tongyi Lab, Alibaba Group
LLMRecommendation
L
Lixin Gu
Shanghai AI Laboratory
Xuehui Wang
Xuehui Wang
PhD Candidate, Shanghai Jiao Tong University
Computer VisionSegmentationDetection
Qingyun Li
Qingyun Li
University of Electronic Science and Technology of China
wireless communicationsinformation theory
Yiming Ren
Yiming Ren
Tsinghua University
Object Detection、Multimodal Large Language Model
Z
Zixuan Chen
SenseTime Research
J
Jiapeng Luo
SenseTime Research
J
Jiahao Wang
SenseTime Research
T
Tan Jiang
SenseTime Research
B
Bo Wang
SenseTime Research
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
Botian Shi
Botian Shi
Shanghai Artificial Intelligence Laboratory
VLMsDocument UnderstandingAutonomous Driving
X
Xingcheng Zhang
Shanghai AI Laboratory
H
Han Lv
Shanghai AI Laboratory
Y
Yi Wang
Shanghai AI Laboratory
Wenqi Shao
Wenqi Shao
Researcher at Shanghai AI Laboratory
Foundation Model EvaluationLLM CompressionEfficient AdaptationMultimodal Learning
P
Pei Chu
Shanghai AI Laboratory
Z
Zhongying Tu
Shanghai AI Laboratory
T
Tong He
Shanghai AI Laboratory
Z
Zhiyong Wu
Shanghai AI Laboratory
H
Hui Deng
Shanghai AI Laboratory
J
Jiaye Ge
Shanghai AI Laboratory
K
Kaiming Chen
Shanghai AI Laboratory
Min Dou
Min Dou
Shanghai AI Laboratory
Autonomous DrivingMLLMEmbodied AI
Lewei Lu
Lewei Lu
Research Director (We're Hiring, luotto@sensetime.com) @ SenseTime Research
Computer VisionDeep Learning
Xizhou Zhu
Xizhou Zhu
Tsinghua University
T
Tong Lu
Nanjing University, Shanghai AI Laboratory
D
Dahu Lin
The Chinese University of Hong Kong, Shanghai AI Laboratory
Y
Yunfeng Qiao
Shanghai AI Laboratory
Jifeng Dai
Jifeng Dai
Associate Professor of EE, Tsinghua University
computer visiondeep learning
W
Wenhai Wang
The Chinese University of Hong Kong, Shanghai AI Laboratory