🤖 AI Summary
Multimodal large language models (MLLMs) often suffer from incoherent reasoning steps and weak visual grounding, primarily because existing alignment methods supervise only final answers while neglecting the reliability of intermediate reasoning processes.
Method: We propose SR-MCR, a label-free self-rewarding framework that introduces the first unsupervised process alignment mechanism grounded in intrinsic output signals. It features a five-dimensional self-referential reliability model integrating semantic alignment, lexical fidelity, non-redundancy, visual grounding, and inter-step consistency. We further design a normalized reliability-weighted reward and confidence-aware temperature scaling for critic-free GRPO optimization.
Results: Evaluated on the Qwen2.5-VL architecture, SR-MCR-7B achieves 81.4% average accuracy across multiple visual reasoning benchmarks—outperforming comparable open-source models and simultaneously improving both reasoning coherence and answer accuracy.
📝 Abstract
Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues -- semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency -- are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.