SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit limited capability in complex reasoning tasks requiring self-reflection and iterative correction; existing reflection methods are simplistic and inefficient, while post-pretraining knowledge and reasoning abilities remain static. Method: We propose SRPO, a two-stage reflection-aware reinforcement learning framework: (1) synthesizing high-quality reflective data, and (2) introducing Group Relative Policy Optimization (GRPO) coupled with a cognition-driven, concise reflection reward mechanism to jointly evolve reasoning and reflection capabilities. Contribution/Results: SRPO breaks fixed knowledge boundaries and pioneers end-to-end reflection focused on the entire policy optimization process. Evaluated on MathVista, MathVision, MathVerse, and MMMU-Pro, SRPO built upon Qwen-2.5-VL significantly outperforms state-of-the-art methods, achieving simultaneous gains in both reasoning accuracy and reflection quality—demonstrating the efficacy and scalability of reflection-driven multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.

Problem

Research questions and friction points this paper is trying to address.

Enhancing MLLM reasoning with self-reflection and self-correction

Overcoming simplistic reflection methods in multimodal LLMs

Improving reasoning accuracy and reflection quality via RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage reflection-aware RL framework

High-quality reflection-focused dataset construction

Novel reward mechanism for meaningful reflection

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting