Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

📅 2025-04-08
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently scaling large language models (LLMs) to multimodal vision-language understanding while preserving strong textual reasoning capabilities. We propose a lightweight multimodal reasoning framework that avoids retraining language or vision backbones, instead leveraging a learnable visual projector for text-image joint reasoning. To enhance cross-modal alignment, we innovatively integrate supervised fine-tuning (SFT) with group-relative policy optimization (GRPO). Furthermore, we introduce adaptive-length chain-of-thought distillation, dynamically optimizing reasoning chain length to balance inference efficiency and accuracy. Evaluated on benchmark suites, our 38B-parameter model achieves 69.0 on MMMU and 67.5 on MathVista, while maintaining state-of-the-art textual reasoning performance (72.0 on AIME and 94.0 on MATH500). All model weights are publicly released.

Technology Category

Application Category

📝 Abstract
We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.
Problem

Research questions and friction points this paper is trying to address.

Extends LLM to visual modalities efficiently
Enhances visual-text alignment via hybrid optimization
Improves reasoning efficiency with adaptive-length CoT
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight visual projector enables multimodal adaptation
Hybrid optimization enhances visual-text alignment
Adaptive-length Chain-of-Thought distillation improves reasoning
🔎 Similar Papers
No similar papers found.