STEP3-VL-10B Technical Report

📅 2026-01-14
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of reconciling efficiency and performance in multimodal models at the 10-billion-parameter scale. The authors propose a unified, fully unfrozen vision–language pretraining architecture, pretrained on 1.2 trillion multimodal tokens and further refined through over 1,000 rounds of reinforcement learning. A novel test-time parallel coordinated reasoning mechanism (PaCoRe) is introduced to enable scalable synergy between perception and reasoning. Built upon Qwen3-8B, the resulting open-source model achieves state-of-the-art results—matching or surpassing significantly larger models and top proprietary systems such as Gemini 2.5 Pro—on multiple benchmarks, including MMBench (92.2%), MMMU (80.11%), AIME2025 (94.43%), and MathVision (75.95%).

Technology Category

Application Category

📝 Abstract
We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.
Problem

Research questions and friction points this paper is trying to address.

multimodal intelligence
compact model
vision-language synergy
complex reasoning
efficient foundation model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel Coordinated Reasoning
unified unfrozen pre-training
multimodal foundation model
reinforcement learning post-training
vision-language synergy
🔎 Similar Papers
No similar papers found.
A
Ailin Huang
Multimodal Intelligence Team, StepFun
Chengyuan Yao
Chengyuan Yao
Columbia University
Educational Data ScienceTransfer LearningAlgorithmic fairness
C
Chunrui Han
Multimodal Intelligence Team, StepFun
Fanqi Wan
Fanqi Wan
Sun Yat-sen University
NLPLLMs
H
Hangyu Guo
Multimodal Intelligence Team, StepFun
H
Haoran Lv
Multimodal Intelligence Team, StepFun
H
Hongyu Zhou
Multimodal Intelligence Team, StepFun
J
Jia Wang
Multimodal Intelligence Team, StepFun
J
Jian Zhou
Multimodal Intelligence Team, StepFun
J
Jian‐Yuan Sun
Multimodal Intelligence Team, StepFun
Jingcheng Hu
Jingcheng Hu
Tsinghua University
Reasoning Foundation ModelMulti-Agent Learning
K
Kangheng Lin
Multimodal Intelligence Team, StepFun
Liang Zhao
Liang Zhao
StepFun
MLLMLLM
M
Mitt Huang
Multimodal Intelligence Team, StepFun
Song Yuan
Song Yuan
Zhejiang University, CAGE
Development EconomicsInternational EconomicsPolitical EconomyEconomic History
W
Wenwen Qu
Multimodal Intelligence Team, StepFun
X
Xiangfeng Wang
Multimodal Intelligence Team, StepFun
Y
Yanlin Lai
Multimodal Intelligence Team, StepFun
Y
Ying-Ying Zhao
Multimodal Intelligence Team, StepFun
Yinmin Zhang
Yinmin Zhang
PhD Student at The University of Sydney
Large Language ModelReinforcement LearningDeep LearningComputer Vision
Y
Yukang Shi
Multimodal Intelligence Team, StepFun
Y
Yuyang Chen
Multimodal Intelligence Team, StepFun
Zejia Weng
Zejia Weng
Fudan University
computer visionvideo understandingmulti modal learning
Z
Ziyang Meng
Multimodal Intelligence Team, StepFun
A
Ang Li
Multimodal Intelligence Team, StepFun
Aobo Kong
Aobo Kong
Nankai University
NLPLLM
B
Bo Dong
Multimodal Intelligence Team, StepFun
C
C. Wan
Multimodal Intelligence Team, StepFun
D
David Wang
Multimodal Intelligence Team, StepFun
Di Qi
Di Qi
Purdue University
applied and computational mathematics
D
Dingming Li
Multimodal Intelligence Team, StepFun
E
En Yu
Multimodal Intelligence Team, StepFun
G
Guopeng Li
Multimodal Intelligence Team, StepFun
H
Haiquan Yin
Multimodal Intelligence Team, StepFun
H
Han Zhou
Multimodal Intelligence Team, StepFun
H
Hanshan Zhang
Multimodal Intelligence Team, StepFun
H
Haolong Yan
Multimodal Intelligence Team, StepFun
H
Hebin Zhou
Multimodal Intelligence Team, StepFun
H
Hongbo Peng
Multimodal Intelligence Team, StepFun
J
Jiaran Zhang
Multimodal Intelligence Team, StepFun
J
Jiashu Lv
Multimodal Intelligence Team, StepFun
Jiayi Fu
Jiayi Fu
Nankai University
Jie Cheng
Jie Cheng
Institute of Automation, Chinese Academy of Sciences
Reinforcement Learning
J
Jie Zhou
Multimodal Intelligence Team, StepFun
J
Jisheng Yin
Multimodal Intelligence Team, StepFun
J
Jin Xie
Multimodal Intelligence Team, StepFun
J
Jingwei Wu
Multimodal Intelligence Team, StepFun
Jun Zhang
Jun Zhang
ByteDance
Speech RecognitionAcoustic Event DetectionBCI
J
Junfeng Liu
Multimodal Intelligence Team, StepFun
K
Kaijun Tan
Multimodal Intelligence Team, StepFun
K
Kaiwen Yan
Multimodal Intelligence Team, StepFun
Liangyu Chen
Liangyu Chen
StepFun
video generationlow-level vision
L
Lina Chen
Multimodal Intelligence Team, StepFun
Mingliang Li
Mingliang Li
Tsinghua University
Computer System
Q
Qian Zhao
Multimodal Intelligence Team, StepFun
Q
Quan Sun
Multimodal Intelligence Team, StepFun
S
Shaoliang Pang
Multimodal Intelligence Team, StepFun
S
Shengjie Fan
Multimodal Intelligence Team, StepFun
S
S. Shang
Multimodal Intelligence Team, StepFun
S
Siyuan Zhang
Multimodal Intelligence Team, StepFun
T
Tian You
Multimodal Intelligence Team, StepFun
W
Wei Ji
Multimodal Intelligence Team, StepFun
W
Wuxun Xie
Multimodal Intelligence Team, StepFun
X
Xiaobo Yang
Multimodal Intelligence Team, StepFun
X
Xiaojie Hou
Multimodal Intelligence Team, StepFun
X
Xiao-Bo Jiao
Multimodal Intelligence Team, StepFun
X
Xiaoxiao Ren
Multimodal Intelligence Team, StepFun
X
Xiangwen Kong
Multimodal Intelligence Team, StepFun
X
Xin Huang
Multimodal Intelligence Team, StepFun
X
Xin Wu
Multimodal Intelligence Team, StepFun
X
Xing Chen
Multimodal Intelligence Team, StepFun
X
Xinran Wang
Multimodal Intelligence Team, StepFun
X
Xue-Li Zhang
Multimodal Intelligence Team, StepFun
Y
Yana Wei
Multimodal Intelligence Team, StepFun
Y
Yang Li
Multimodal Intelligence Team, StepFun
Y
Yanming Xu
Multimodal Intelligence Team, StepFun
Y
Yeqing Shen
Multimodal Intelligence Team, StepFun
Yuang Peng
Yuang Peng
Tsinghua University
Generative ModelMultimodal Learning
Yue Peng
Yue Peng
University of Science and Technology of China
geometry optimizationphysical simulation
Yu Zhou
Yu Zhou
StepFun
SDNNFV
Y
Yusheng Li
Multimodal Intelligence Team, StepFun
Y
Yuxiang Yang
Multimodal Intelligence Team, StepFun
Yuyang Zhang
Yuyang Zhang
Graduate Student, Harvard University
Reinforcement LearningControl Theory
Zhe Xie
Zhe Xie
Tsinghua University, Shanghai Jiao Tong University
Anomaly DetectionTime SeriesAIOpsLLM
Z
Zhewei Huang
Multimodal Intelligence Team, StepFun
Z
Zhenzhi Lu
Multimodal Intelligence Team, StepFun
Z
Zhimin Fan
Multimodal Intelligence Team, StepFun
Z
Zihui Cheng
Multimodal Intelligence Team, StepFun
Daxin Jiang
Daxin Jiang
Co-Founder & CEO, StepFun Corporation
Deep LearningFoundation Models
Qi Han
Qi Han
StepFun
Vision Foundation modelLarge Language Model
X
Xiangyun Zhang
Multimodal Intelligence Team, StepFun
Yibo Zhu
Yibo Zhu
StepFun
Machine Learning SystemsComputer NetworksDistributed Systems
Zheng Ge
Zheng Ge
Senior Researcher, StepFun
Multimodal Models Perception and Reasoning