Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains

📅 2025-12-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) often suffer from incoherent reasoning steps and weak visual grounding, primarily because existing alignment methods supervise only final answers while neglecting the reliability of intermediate reasoning processes. Method: We propose SR-MCR, a label-free self-rewarding framework that introduces the first unsupervised process alignment mechanism grounded in intrinsic output signals. It features a five-dimensional self-referential reliability model integrating semantic alignment, lexical fidelity, non-redundancy, visual grounding, and inter-step consistency. We further design a normalized reliability-weighted reward and confidence-aware temperature scaling for critic-free GRPO optimization. Results: Evaluated on the Qwen2.5-VL architecture, SR-MCR-7B achieves 81.4% average accuracy across multiple visual reasoning benchmarks—outperforming comparable open-source models and simultaneously improving both reasoning coherence and answer accuracy.

Technology Category

Application Category

📝 Abstract
Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues -- semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency -- are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.
Problem

Research questions and friction points this paper is trying to address.

Improves multimodal reasoning coherence and reliability
Enhances visual grounding in reasoning processes
Stabilizes training to suppress trivial generations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-rewarded alignment using intrinsic process signals
Five self-referential cues integrated into reliability-weighted reward
Critic-free GRPO objective with confidence-aware cooling mechanism
🔎 Similar Papers
No similar papers found.
J
Jesen Zhang
Sun Yat-sen University
N
Ningyuan Liu
Sun Yat-sen University
K
Kaitong Cai
Sun Yat-sen University
S
Sidi Liu
Sun Yat-sen University
J
Jing Yang
Sun Yat-sen University
Ziliang Chen
Ziliang Chen
AP, Pengcheng Lab
Machine learningFoundation ModelsMultimodal Embodied Intelligence
Xiaofei Sun
Xiaofei Sun
Stony Brook University, Zhejiang University
Social and Information NetworkNatural Language ProcessingMachine Learning
K
Keze Wang
Sun Yat-sen University