🤖 AI Summary
To address reasoning unfaithfulness in multimodal large language models (MLLMs) trained via reinforcement learning—where only final-answer rewards are provided—this paper proposes AutoRubric-R1V, a framework that enables explicit supervision of the reasoning process through automatically generated, question-specific rubrics. Its core innovation is a scalable, self-aggregating method that extracts consistent checkpoints from successful reasoning trajectories without requiring human annotations or strong teacher models, thereby constructing faithful process-level supervision signals. Furthermore, it introduces a dual-signal reward mechanism integrating generative process rewards with outcome rewards. Evaluated on six mainstream multimodal reasoning benchmarks, AutoRubric-R1V achieves state-of-the-art performance and demonstrates significant improvements over existing methods on dedicated reasoning faithfulness evaluations.
📝 Abstract
Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi-step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final-answer correctness is rewarded. To address this limitation, we propose AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards. Our key innovation lies in a scalable self-aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem-specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric-based and outcome rewards, AutoRubric-R1V achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.