🤖 AI Summary
To address the limitations of existing vision-language models in fine-grained perception and complex reasoning, this paper introduces SAIL-VL2, an open-source multimodal foundation model supporting both image and video understanding. Methodologically, we (1) develop a high-quality image-text–video data curation pipeline; (2) propose a progressive training framework integrating visual encoder pretraining with chain-of-thought–enhanced supervised fine-tuning and reinforcement learning; and (3) adopt the SAIL-ViT visual backbone coupled with a sparse Mixture-of-Experts (MoE) architecture to improve parameter efficiency. Evaluated across 106 benchmarks, SAIL-VL2-2B achieves state-of-the-art performance on key reasoning tasks—including MMMU and MathVista—and ranks first on the OpenCompass leaderboard for 2B-scale models. These results significantly advance the development of open multimodal foundation models.
📝 Abstract
We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.