STEP3-VL-10B Technical Report

📅 2026-01-14

📈 Citations: 2

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the challenge of reconciling efficiency and performance in multimodal models at the 10-billion-parameter scale. The authors propose a unified, fully unfrozen vision–language pretraining architecture, pretrained on 1.2 trillion multimodal tokens and further refined through over 1,000 rounds of reinforcement learning. A novel test-time parallel coordinated reasoning mechanism (PaCoRe) is introduced to enable scalable synergy between perception and reasoning. Built upon Qwen3-8B, the resulting open-source model achieves state-of-the-art results—matching or surpassing significantly larger models and top proprietary systems such as Gemini 2.5 Pro—on multiple benchmarks, including MMBench (92.2%), MMMU (80.11%), AIME2025 (94.43%), and MathVision (75.95%).

Technology Category

Application Category

📝 Abstract

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

Problem

Research questions and friction points this paper is trying to address.

multimodal intelligence

compact model

vision-language synergy

complex reasoning

efficient foundation model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel Coordinated Reasoning

unified unfrozen pre-training

multimodal foundation model