InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the performance gap between open-source multimodal models and state-of-the-art commercial models (e.g., GPT-5) in generality, reasoning capability, and inference efficiency, this work proposes a cascaded reinforcement learning framework enabling hierarchical “coarse-to-fine” reasoning optimization. We introduce a Vision Resolution Router (ViR) that dynamically adapts image token resolution, and adopt Decoupled Vision–Language Deployment (DvD) alongside large-scale distributed inference techniques. The method integrates offline and online reinforcement learning. On benchmarks including MMMU and MathVista, it achieves up to 16.0% improvement in reasoning accuracy and 4.05× higher inference throughput over prior open-source models; the largest variant attains SOTA across multiple open-source evaluations. This work substantially narrows the performance gap between open-source and proprietary multimodal models and extends support to emerging applications such as GUI interaction and embodied intelligence.

Technology Category

Application Category

📝 Abstract

We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05$ imes$ inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal model versatility and reasoning capabilities

Improving inference efficiency through dynamic visual token resolution

Enabling GUI interaction and embodied agency in open-source models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascade Reinforcement Learning enhances reasoning

Visual Resolution Router adjusts token resolution

Decoupled Vision-Language Deployment balances GPU load

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs