SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited capability in complex reasoning tasks requiring self-reflection and iterative correction; existing reflection methods are simplistic and inefficient, while post-pretraining knowledge and reasoning abilities remain static. Method: We propose SRPO, a two-stage reflection-aware reinforcement learning framework: (1) synthesizing high-quality reflective data, and (2) introducing Group Relative Policy Optimization (GRPO) coupled with a cognition-driven, concise reflection reward mechanism to jointly evolve reasoning and reflection capabilities. Contribution/Results: SRPO breaks fixed knowledge boundaries and pioneers end-to-end reflection focused on the entire policy optimization process. Evaluated on MathVista, MathVision, MathVerse, and MMMU-Pro, SRPO built upon Qwen-2.5-VL significantly outperforms state-of-the-art methods, achieving simultaneous gains in both reasoning accuracy and reflection quality—demonstrating the efficacy and scalability of reflection-driven multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.
Problem

Research questions and friction points this paper is trying to address.

Enhancing MLLM reasoning with self-reflection and self-correction
Overcoming simplistic reflection methods in multimodal LLMs
Improving reasoning accuracy and reflection quality via RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage reflection-aware RL framework
High-quality reflection-focused dataset construction
Novel reward mechanism for meaningful reflection
🔎 Similar Papers
No similar papers found.
Zhongwei Wan
Zhongwei Wan
The Ohio State University, PhD student
LLMMultimodalNLP
Z
Zhihao Dou
Case Western Reserve University
Che Liu
Che Liu
Imperial College London
Multimodal learningAI4Medicine
Y
Yu Zhang
Tongji University
D
Dongfei Cui
Duke University
Qinjian Zhao
Qinjian Zhao
Kean University
H
Hui Shen
ByteDance
J
Jing Xiong
The University of Hong Kong
Yi Xin
Yi Xin
California Institute of Technology
Industrial OrganizationEconometrics
Y
Yifan Jiang
University of Southern California
Yangfan He
Yangfan He
University of Minnesota - Twin Cities
AI AgentReasoningAI AlignmentFoundation Models
M
Mi Zhang
The Ohio State University
S
Shen Yan
ByteDance