Ovis2.5 Technical Report

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak native-resolution perception, limited reasoning capability, and low deployment efficiency of multimodal large language models (MLLMs) on high-resolution vision-intensive tasks—such as complex chart analysis, visual grounding, and video understanding—this paper introduces the Ovis2.5 series. Methodologically: (i) we design a native-resolution ViT backbone to preserve fine-grained image details; (ii) we propose a dual-path reasoning architecture integrating linear inference with reflective inference, enabling self-checking and dynamic revision; and (iii) we employ five-stage curriculum learning, multimodal data packing, hybrid parallel training, and alignment optimization via joint DPO and GRPO. Our key contribution is achieving state-of-the-art performance for compact models under resource constraints: Ovis2.5-9B and Ovis2.5-2B attain average scores of 78.3 and 73.9, respectively, on OpenCompass—ranking first among open-source models of comparable scale—and significantly outperform prior work on STEM reasoning, chart comprehension, and video understanding benchmarks.

Technology Category

Application Category

📝 Abstract
We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an optional "thinking mode" at inference time, allowing users to trade latency for enhanced accuracy on difficult inputs. The model is trained via a comprehensive five-phase curriculum that progressively builds its skills. The process begins with foundational visual and multimodal pretraining, advances through large-scale instruction tuning, and culminates in alignment and reasoning enhancement using DPO and GRPO. To scale these upgrades efficiently, we employ multimodal data packing and hybrid parallelism, yielding a significant end-to-end speedup. We release two open-source models: Ovis2.5-9B and Ovis2.5-2B. The latter continues the "small model, big performance" philosophy of Ovis2, making it ideal for resource-constrained, on-device scenarios. On the OpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking a substantial improvement over its predecessor, Ovis2-8B, and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregate scores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strong capabilities on grounding and video tasks, and achieves open-source SOTA at its scale for complex chart analysis.
Problem

Research questions and friction points this paper is trying to address.

Enhancing native-resolution visual perception for dense content
Advancing multimodal reasoning beyond linear chain-of-thought
Optimizing model performance for resource-constrained deployment scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Native-resolution vision transformer for detailed visual perception
Reflection-based reasoning with self-checking and revision
Five-phase curriculum training with DPO and GRPO
🔎 Similar Papers
No similar papers found.
Shiyin Lu
Shiyin Lu
Alibaba Group
Multimodal Large Language ModelsOnline LearningBandits
Y
Yang Li
Ovis Team, Alibaba Group
Y
Yu Xia
Ovis Team, Alibaba Group
Y
Yuwei Hu
Ovis Team, Alibaba Group
S
Shanshan Zhao
Ovis Team, Alibaba Group
Y
Yanqing Ma
Ovis Team, Alibaba Group
Z
Zhichao Wei
Ovis Team, Alibaba Group
Y
Yinglun Li
Ovis Team, Alibaba Group
L
Lunhao Duan
Ovis Team, Alibaba Group
J
Jianshan Zhao
Ovis Team, Alibaba Group
Yuxuan Han
Yuxuan Han
Tsinghua University
computer visioncomputer graphics
Haijun Li
Haijun Li
Washington State University
ProbabilityRisk TheoryMultivariate Extremes
W
Wanying Chen
Ovis Team, Alibaba Group
J
Junke Tang
Ovis Team, Alibaba Group
C
Chengkun Hou
Ovis Team, Alibaba Group
Z
Zhixing Du
Ovis Team, Alibaba Group
T
Tianli Zhou
Ovis Team, Alibaba Group
W
Wenjie Zhang
Ovis Team, Alibaba Group
H
Huping Ding
Ovis Team, Alibaba Group
J
Jiahe Li
Ovis Team, Alibaba Group
W
Wen Li
Ovis Team, Alibaba Group
G
Gui Hu
Ovis Team, Alibaba Group
Y
Yiliang Gu
Ovis Team, Alibaba Group
S
Siran Yang
Ovis Team, Alibaba Group
J
Jiamang Wang
Ovis Team, Alibaba Group