🤖 AI Summary
This work addresses the challenge of efficiently transferring the strong reasoning capabilities of pure-text large language models (LLMs) to vision-language tasks while achieving high-fidelity cross-modal alignment and multimodal reasoning. To this end, we introduce Skywork-R1V3—the first open-source vision-language model trained via reinforcement learning–based post-training (RLPT), enabling effective activation and transfer of textual reasoning abilities without additional pretraining. Key contributions include: (i) identifying the critical role of connector modules in cross-modal alignment; (ii) proposing an interpretable evaluation metric based on entropy of key reasoning tokens; and (iii) integrating curriculum learning with reinforcement fine-tuning. On the MMMU benchmark, the 38B-parameter model achieves an accuracy improvement from 64.3% to 76.0%, reaching human-entry-level performance, and demonstrates exceptional generalization on discipline-specific reasoning tasks—particularly mathematics—matching state-of-the-art closed-source models.
📝 Abstract
We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.