SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing result-reward-based reinforcement learning (RL) methods for multimodal large language models (MLLMs) lack explicit supervision over reasoning processes, leading to suboptimal policies and limited generalization. This work addresses this gap by proposing a *trustworthy reasoning reward* framework, introducing the first *Thinking Reward Model* (TRM) and a *credibility-weighted optimization algorithm*, Trust-GRPO, alongside a *reward annealing strategy* to dynamically balance reasoning quality and final outcome during training. Within a rule-guided RL paradigm, our approach explicitly models and optimizes the reasoning path. Evaluated on benchmarks including MathVisita and MMMU, our SophiaVL-R1-7B model significantly outperforms LLaVA-OneVision-72B, achieving substantial gains in reasoning generalization. All code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final outcome.As a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at https://github.com/kxfan2002/SophiaVL-R1.

Problem

Research questions and friction points this paper is trying to address.

Lack of supervision over MLLMs' reasoning process

Sub-optimal strategies hinder generalization ability

Unreliable thinking rewards due to reward hacking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces thinking reward model for reasoning quality

Proposes Trust-GRPO to mitigate unreliable rewards

Uses annealing training to balance reward reliance

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting