Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

📅 2025-04-23

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the fundamental trade-off between strong reasoning capability and generalization in multimodal large language models (MLLMs). To resolve this, we propose a novel hybrid reinforcement learning paradigm that integrates reward-model guidance with rule-based policy coordination. We further design a Selective Sample Buffering (SSB) mechanism to mitigate advantage collapse in GRPO, and introduce a visual hallucination monitoring module coupled with dynamic threshold calibration. Through multimodal joint training and fine-grained reward modeling, our approach achieves state-of-the-art performance among open-source models on OlympiadBench (62.6), AIME2024 (79.0), LiveCodeBench (63.6), and MMMU (74.0), approaching the capabilities of Gemini 2.5 and o4-mini. The method establishes a scalable, robust, and efficient technical pathway for complex multimodal reasoning.

Technology Category

Application Category

📝 Abstract

We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that harmonizes reward-model guidance with rule-based strategies, thereby addressing the long-standing challenge of balancing sophisticated reasoning capabilities with broad generalization. To further enhance training efficiency, we propose the Selective Sample Buffer (SSB) mechanism, which effectively counters the ``Vanishing Advantages'' dilemma inherent in Group Relative Policy Optimization (GRPO) by prioritizing high-value samples throughout the optimization process. Notably, we observe that excessive reinforcement signals can induce visual hallucinations--a phenomenon we systematically monitor and mitigate through calibrated reward thresholds throughout the training process. Empirical results affirm the exceptional capability of R1V2, with benchmark-leading performances such as 62.6 on OlympiadBench, 79.0 on AIME2024, 63.6 on LiveCodeBench, and 74.0 on MMMU. These results underscore R1V2's superiority over existing open-source models and demonstrate significant progress in closing the performance gap with premier proprietary systems, including Gemini 2.5 and OpenAI o4-mini. The Skywork R1V2 model weights have been publicly released to promote openness and reproducibility https://huggingface.co/Skywork/Skywork-R1V2-38B.

Problem

Research questions and friction points this paper is trying to address.

Balancing reasoning and generalization via hybrid reinforcement learning

Overcoming vanishing advantages in optimization with Selective Sample Buffer

Mitigating visual hallucinations from excessive reinforcement signals

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid reinforcement learning balances reasoning and generalization

Selective Sample Buffer counters vanishing advantages

Calibrated reward thresholds mitigate visual hallucinations

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting

2024-10-10Citations: 0

Anthropic

$500,000—$850,000 USD

San Francisco, CA, USA

AI Research Scientist, VLM (vision language models)