InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance gap between open-source multimodal models and state-of-the-art commercial models (e.g., GPT-5) in generality, reasoning capability, and inference efficiency, this work proposes a cascaded reinforcement learning framework enabling hierarchical “coarse-to-fine” reasoning optimization. We introduce a Vision Resolution Router (ViR) that dynamically adapts image token resolution, and adopt Decoupled Vision–Language Deployment (DvD) alongside large-scale distributed inference techniques. The method integrates offline and online reinforcement learning. On benchmarks including MMMU and MathVista, it achieves up to 16.0% improvement in reasoning accuracy and 4.05× higher inference throughput over prior open-source models; the largest variant attains SOTA across multiple open-source evaluations. This work substantially narrows the performance gap between open-source and proprietary multimodal models and extends support to emerging applications such as GUI interaction and embodied intelligence.

Technology Category

Application Category

📝 Abstract
We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05$ imes$ inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal model versatility and reasoning capabilities
Improving inference efficiency through dynamic visual token resolution
Enabling GUI interaction and embodied agency in open-source models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascade Reinforcement Learning enhances reasoning
Visual Resolution Router adjusts token resolution
Decoupled Vision-Language Deployment balances GPU load
🔎 Similar Papers
No similar papers found.
Weiyun Wang
Weiyun Wang
Shanghai AI Laboratory; Fudan University
Vision-Language ModelMLLMFoundation Model
Z
Zhangwei Gao
InternVL Team, Shanghai AI Laboratory
L
Lixin Gu
InternVL Team, Shanghai AI Laboratory
H
Hengjun Pu
InternVL Team, Shanghai AI Laboratory
L
Long Cui
InternVL Team, Shanghai AI Laboratory
X
Xingguang Wei
InternVL Team, Shanghai AI Laboratory
Zhaoyang Liu
Zhaoyang Liu
Tongyi Lab, Alibaba Group
LLMRecommendation
L
Linglin Jing
InternVL Team, Shanghai AI Laboratory
S
Shenglong Ye
InternVL Team, Shanghai AI Laboratory
Jie Shao
Jie Shao
Professor, University of Electronic Science and Technology of China
MultimediaDatabase
Zhaokai Wang
Zhaokai Wang
Shanghai Jiao Tong University; Shanghai AI Laboratory
Computer VisionAI MusicMLLMs
Z
Zhe Chen
InternVL Team, Shanghai AI Laboratory
Hongjie Zhang
Hongjie Zhang
Nanjing University; Shanghai Artificial Intelligence Laboratory
Computer Vision
Ganlin Yang
Ganlin Yang
University of Science and Technology of China && Shanghai AI Laboratory
Computer vision3D visionMultimodal models
Haomin Wang
Haomin Wang
Shanghai AI Laboratory | Shanghai Jiao Tong University
Computer VisionMultimodal Large Language Models
Qi Wei
Qi Wei
Associate Professor of Bioengineering, George Mason University
Biomechanicsmodeling and simulationbiomedical imaging
J
Jinhui Yin
InternVL Team, Shanghai AI Laboratory
W
Wenhao Li
InternVL Team, Shanghai AI Laboratory
Erfei Cui
Erfei Cui
Shanghai AI Laboratory; Shanghai JiaoTong University
Computer Vision
Guanzhou Chen
Guanzhou Chen
Shanghai Jiao Tong University; Shanghai AI Laboratory
Zichen Ding
Zichen Ding
Shanghai AI Laboratory
Computer-Use AgentAI AgentsLarge Language Models
Changyao Tian
Changyao Tian
MMLab, CUHK
Computer VisionDeep Learning
Z
Zhenyu Wu
InternVL Team, Shanghai AI Laboratory
J
Jingjing Xie
InternVL Team, Shanghai AI Laboratory
Zehao Li
Zehao Li
Peking University
Operations researchStochastic approximation