Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing methods face dual bottlenecks when transferring LLM inference paradigms to multimodal large language models (MLLMs): pure reinforcement learning (RL) suffers from low sample efficiency and fails to activate entirely absent reasoning capabilities, while the conventional “supervised fine-tuning (SFT) first, then RL” pipeline suppresses exploration and leads to suboptimal convergence. This paper proposes a novel two-stage paradigm—“RL-first activation, SFT-later enhancement”—that bypasses cold-start SFT entirely. We introduce the first integrated “RL激励 + SFT enhancement” framework, incorporating a Group Relative Policy Optimization (GRPO) variant, self-distilled reasoning trajectory generation, expert-knowledge-injected prompt construction, and curriculum-driven multi-stage SFT. Evaluated on the OpenCompass multimodal reasoning benchmark, our released 7B and 72B models achieve state-of-the-art performance at their respective parameter scales; notably, the 72B model ranks fourth overall.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have witnessed a surge in the development of advanced reasoning paradigms, which are now being integrated into multimodal large language models (MLLMs). However, existing approaches often fall short: methods solely employing reinforcement learning (RL) can struggle with sample inefficiency and activating entirely absent reasoning capabilities, while conventional pipelines that initiate with a cold-start supervised fine-tuning (SFT) phase before RL may restrict the model's exploratory capacity and face suboptimal convergence. In this work, we introduce extbf{Metis-RISE} ( extbf{R}L extbf{I}ncentivizes and extbf{S}FT extbf{E}nhances) for multimodal reasoning model learning. Unlike conventional approaches, Metis-RISE distinctively omits an initial SFT stage, beginning instead with an RL phase (e.g., using a Group Relative Policy Optimization variant) to incentivize and activate the model's latent reasoning capacity. Subsequently, the targeted SFT stage addresses two key challenges identified during RL: (1) extit{inefficient trajectory sampling} for tasks where the model possesses but inconsistently applies correct reasoning, which we tackle using self-distilled reasoning trajectories from the RL model itself; and (2) extit{fundamental capability absence}, which we address by injecting expert-augmented knowledge for prompts where the model entirely fails. This strategic application of RL for incentivization followed by SFT for enhancement forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard demonstrate that both models achieve state-of-the-art performance among similar-sized models, with the 72B version ranking fourth overall.

Problem

Research questions and friction points this paper is trying to address.

Improves sample efficiency in RL for multimodal reasoning

Enhances exploratory capacity in model learning

Addresses inconsistent reasoning and capability gaps

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL phase first to activate reasoning capacity

Self-distilled trajectories for efficient sampling

Expert-augmented knowledge for missing capabilities

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting