🤖 AI Summary
This work addresses the challenge of efficiently scaling large language models (LLMs) to multimodal vision-language understanding while preserving strong textual reasoning capabilities. We propose a lightweight multimodal reasoning framework that avoids retraining language or vision backbones, instead leveraging a learnable visual projector for text-image joint reasoning. To enhance cross-modal alignment, we innovatively integrate supervised fine-tuning (SFT) with group-relative policy optimization (GRPO). Furthermore, we introduce adaptive-length chain-of-thought distillation, dynamically optimizing reasoning chain length to balance inference efficiency and accuracy. Evaluated on benchmark suites, our 38B-parameter model achieves 69.0 on MMMU and 67.5 on MathVista, while maintaining state-of-the-art textual reasoning performance (72.0 on AIME and 94.0 on MATH500). All model weights are publicly released.
📝 Abstract
We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.