Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains

📅 2025-12-27

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Multimodal large language models (MLLMs) often suffer from incoherent reasoning steps and weak visual grounding, primarily because existing alignment methods supervise only final answers while neglecting the reliability of intermediate reasoning processes. Method: We propose SR-MCR, a label-free self-rewarding framework that introduces the first unsupervised process alignment mechanism grounded in intrinsic output signals. It features a five-dimensional self-referential reliability model integrating semantic alignment, lexical fidelity, non-redundancy, visual grounding, and inter-step consistency. We further design a normalized reliability-weighted reward and confidence-aware temperature scaling for critic-free GRPO optimization. Results: Evaluated on the Qwen2.5-VL architecture, SR-MCR-7B achieves 81.4% average accuracy across multiple visual reasoning benchmarks—outperforming comparable open-source models and simultaneously improving both reasoning coherence and answer accuracy.

Technology Category

Application Category

📝 Abstract

Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues -- semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency -- are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.

Problem

Research questions and friction points this paper is trying to address.

Improves multimodal reasoning coherence and reliability

Enhances visual grounding in reasoning processes

Stabilizes training to suppress trivial generations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-rewarded alignment using intrinsic process signals

Five self-referential cues integrated into reliability-weighted reward

Critic-free GRPO objective with confidence-aware cooling mechanism

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts