🤖 AI Summary
Existing methods face dual bottlenecks when transferring LLM inference paradigms to multimodal large language models (MLLMs): pure reinforcement learning (RL) suffers from low sample efficiency and fails to activate entirely absent reasoning capabilities, while the conventional “supervised fine-tuning (SFT) first, then RL” pipeline suppresses exploration and leads to suboptimal convergence. This paper proposes a novel two-stage paradigm—“RL-first activation, SFT-later enhancement”—that bypasses cold-start SFT entirely. We introduce the first integrated “RL激励 + SFT enhancement” framework, incorporating a Group Relative Policy Optimization (GRPO) variant, self-distilled reasoning trajectory generation, expert-knowledge-injected prompt construction, and curriculum-driven multi-stage SFT. Evaluated on the OpenCompass multimodal reasoning benchmark, our released 7B and 72B models achieve state-of-the-art performance at their respective parameter scales; notably, the 72B model ranks fourth overall.
📝 Abstract
Recent advancements in large language models (LLMs) have witnessed a surge in the development of advanced reasoning paradigms, which are now being integrated into multimodal large language models (MLLMs). However, existing approaches often fall short: methods solely employing reinforcement learning (RL) can struggle with sample inefficiency and activating entirely absent reasoning capabilities, while conventional pipelines that initiate with a cold-start supervised fine-tuning (SFT) phase before RL may restrict the model's exploratory capacity and face suboptimal convergence. In this work, we introduce extbf{Metis-RISE} ( extbf{R}L extbf{I}ncentivizes and extbf{S}FT extbf{E}nhances) for multimodal reasoning model learning. Unlike conventional approaches, Metis-RISE distinctively omits an initial SFT stage, beginning instead with an RL phase (e.g., using a Group Relative Policy Optimization variant) to incentivize and activate the model's latent reasoning capacity. Subsequently, the targeted SFT stage addresses two key challenges identified during RL: (1) extit{inefficient trajectory sampling} for tasks where the model possesses but inconsistently applies correct reasoning, which we tackle using self-distilled reasoning trajectories from the RL model itself; and (2) extit{fundamental capability absence}, which we address by injecting expert-augmented knowledge for prompts where the model entirely fails. This strategic application of RL for incentivization followed by SFT for enhancement forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard demonstrate that both models achieve state-of-the-art performance among similar-sized models, with the 72B version ranking fourth overall.