🤖 AI Summary
This work addresses the fundamental trade-off between strong reasoning capability and generalization in multimodal large language models (MLLMs). To resolve this, we propose a novel hybrid reinforcement learning paradigm that integrates reward-model guidance with rule-based policy coordination. We further design a Selective Sample Buffering (SSB) mechanism to mitigate advantage collapse in GRPO, and introduce a visual hallucination monitoring module coupled with dynamic threshold calibration. Through multimodal joint training and fine-grained reward modeling, our approach achieves state-of-the-art performance among open-source models on OlympiadBench (62.6), AIME2024 (79.0), LiveCodeBench (63.6), and MMMU (74.0), approaching the capabilities of Gemini 2.5 and o4-mini. The method establishes a scalable, robust, and efficient technical pathway for complex multimodal reasoning.
📝 Abstract
We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that harmonizes reward-model guidance with rule-based strategies, thereby addressing the long-standing challenge of balancing sophisticated reasoning capabilities with broad generalization. To further enhance training efficiency, we propose the Selective Sample Buffer (SSB) mechanism, which effectively counters the ``Vanishing Advantages'' dilemma inherent in Group Relative Policy Optimization (GRPO) by prioritizing high-value samples throughout the optimization process. Notably, we observe that excessive reinforcement signals can induce visual hallucinations--a phenomenon we systematically monitor and mitigate through calibrated reward thresholds throughout the training process. Empirical results affirm the exceptional capability of R1V2, with benchmark-leading performances such as 62.6 on OlympiadBench, 79.0 on AIME2024, 63.6 on LiveCodeBench, and 74.0 on MMMU. These results underscore R1V2's superiority over existing open-source models and demonstrate significant progress in closing the performance gap with premier proprietary systems, including Gemini 2.5 and OpenAI o4-mini. The Skywork R1V2 model weights have been publicly released to promote openness and reproducibility https://huggingface.co/Skywork/Skywork-R1V2-38B.