SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) suffer from two key limitations in reasoning: (1) reliance solely on outcome-based supervision, neglecting the validity and coherence of intermediate reasoning steps; and (2) fixed reasoning strategies, leading to overthinking on simple tasks and underthinking on complex ones. To address these, we propose an adaptive reasoning framework grounded in reinforcement learning, featuring a novel dual-reward mechanism: a *reasoning reward*—modeling factual alignment, logical coherence, and answer consistency—and a *judgment reward*—dynamically determining whether deep reasoning is warranted. This enables joint optimization of reasoning depth and strategy. Additionally, we introduce a dynamic strategy selection mechanism to suppress hallucination and enhance robustness in complex multimodal reasoning. Evaluated on 4B/8B-scale models, our approach significantly outperforms existing open-source MLLMs and matches GPT-4o’s performance, demonstrating strong effectiveness, generalizability, and scalability.

Technology Category

Application Category

📝 Abstract

We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.

Problem

Research questions and friction points this paper is trying to address.

Enhances MLLMs' reasoning by teaching when and how to think

Addresses outcome-only supervision and uniform thinking strategy limitations

Uses dual rewards for reasoning quality and adaptive decision-making

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual reward system evaluates reasoning quality and necessity

Adaptive thinking strategy prevents overthinking and underthinking

Reinforcement learning post-training enhances multimodal model reliability

🔎 Similar Papers

No similar papers found.