Skywork-R1V3 Technical Report

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the challenge of efficiently transferring the strong reasoning capabilities of pure-text large language models (LLMs) to vision-language tasks while achieving high-fidelity cross-modal alignment and multimodal reasoning. To this end, we introduce Skywork-R1V3—the first open-source vision-language model trained via reinforcement learning–based post-training (RLPT), enabling effective activation and transfer of textual reasoning abilities without additional pretraining. Key contributions include: (i) identifying the critical role of connector modules in cross-modal alignment; (ii) proposing an interpretable evaluation metric based on entropy of key reasoning tokens; and (iii) integrating curriculum learning with reinforcement fine-tuning. On the MMMU benchmark, the 38B-parameter model achieves an accuracy improvement from 64.3% to 76.0%, reaching human-entry-level performance, and demonstrates exceptional generalization on discipline-specific reasoning tasks—particularly mathematics—matching state-of-the-art closed-source models.

Technology Category

Application Category

📝 Abstract

We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.

Problem

Research questions and friction points this paper is trying to address.

Transferring text-based reasoning to visual tasks effectively

Enhancing cross-modal alignment via connector module optimization

Improving checkpoint selection using reasoning token entropy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transfers text LLM reasoning to visual tasks

Uses RL post-training to enhance reasoning

Introduces entropy metric for checkpoint selection

🔎 Similar Papers

An Open-Source Soft Robotic Platform for Autonomous Aerial Manipulation in the Wild

2024-09-11Conference on Robot LearningCitations: 6