SAIL-VL2 Technical Report

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the limitations of existing vision-language models in fine-grained perception and complex reasoning, this paper introduces SAIL-VL2, an open-source multimodal foundation model supporting both image and video understanding. Methodologically, we (1) develop a high-quality image-text–video data curation pipeline; (2) propose a progressive training framework integrating visual encoder pretraining with chain-of-thought–enhanced supervised fine-tuning and reinforcement learning; and (3) adopt the SAIL-ViT visual backbone coupled with a sparse Mixture-of-Experts (MoE) architecture to improve parameter efficiency. Evaluated across 106 benchmarks, SAIL-VL2-2B achieves state-of-the-art performance on key reasoning tasks—including MMMU and MathVista—and ranks first on the OpenCompass leaderboard for 2B-scale models. These results significantly advance the development of open multimodal foundation models.

Technology Category

Application Category

📝 Abstract

We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.

Problem

Research questions and friction points this paper is trying to address.

Advancing multimodal understanding and reasoning capabilities

Enhancing training efficiency through data curation and filtering

Extending architectural designs to efficient sparse MoE models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale data curation pipeline

Progressive training framework with SFT-RL

Sparse Mixture-of-Experts architectural design

🔎 Similar Papers

Data Authorisation and Validation in Autonomous Vehicles: A Critical Review