Skywork-R1V3 Technical Report

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently transferring the strong reasoning capabilities of pure-text large language models (LLMs) to vision-language tasks while achieving high-fidelity cross-modal alignment and multimodal reasoning. To this end, we introduce Skywork-R1V3—the first open-source vision-language model trained via reinforcement learning–based post-training (RLPT), enabling effective activation and transfer of textual reasoning abilities without additional pretraining. Key contributions include: (i) identifying the critical role of connector modules in cross-modal alignment; (ii) proposing an interpretable evaluation metric based on entropy of key reasoning tokens; and (iii) integrating curriculum learning with reinforcement fine-tuning. On the MMMU benchmark, the 38B-parameter model achieves an accuracy improvement from 64.3% to 76.0%, reaching human-entry-level performance, and demonstrates exceptional generalization on discipline-specific reasoning tasks—particularly mathematics—matching state-of-the-art closed-source models.

Technology Category

Application Category

📝 Abstract
We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.
Problem

Research questions and friction points this paper is trying to address.

Transferring text-based reasoning to visual tasks effectively
Enhancing cross-modal alignment via connector module optimization
Improving checkpoint selection using reasoning token entropy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transfers text LLM reasoning to visual tasks
Uses RL post-training to enhance reasoning
Introduces entropy metric for checkpoint selection
🔎 Similar Papers
No similar papers found.
W
Wei Shen
Skywork AI, Kunlun Inc
J
Jiangbo Pei
Skywork AI, Kunlun Inc
Yi Peng
Yi Peng
Bytedance
Machine LearningImage ProcessingVisualization
Xuchen Song
Xuchen Song
CTO @ Mureka.ai | Head of Multimodality & Spatial AI @ Skywork.ai
Music GenerationMultimodalityMultimodal UnderstandingMultimodal Generation
Y
Yang Liu
Skywork AI, Kunlun Inc
J
Jian Peng
Skywork AI, Kunlun Inc
H
Haofeng Sun
Skywork AI, Kunlun Inc
Yunzhuo Hao
Yunzhuo Hao
CS PhD Student @ Zhejiang University
MLLMLLMNLP
P
Peiyu Wang
Skywork AI, Kunlun Inc
Y
Yahui Zhou
Skywork AI, Kunlun Inc