AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address reasoning unfaithfulness in multimodal large language models (MLLMs) trained via reinforcement learning—where only final-answer rewards are provided—this paper proposes AutoRubric-R1V, a framework that enables explicit supervision of the reasoning process through automatically generated, question-specific rubrics. Its core innovation is a scalable, self-aggregating method that extracts consistent checkpoints from successful reasoning trajectories without requiring human annotations or strong teacher models, thereby constructing faithful process-level supervision signals. Furthermore, it introduces a dual-signal reward mechanism integrating generative process rewards with outcome rewards. Evaluated on six mainstream multimodal reasoning benchmarks, AutoRubric-R1V achieves state-of-the-art performance and demonstrates significant improvements over existing methods on dedicated reasoning faithfulness evaluations.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi-step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final-answer correctness is rewarded. To address this limitation, we propose AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards. Our key innovation lies in a scalable self-aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem-specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric-based and outcome rewards, AutoRubric-R1V achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.
Problem

Research questions and friction points this paper is trying to address.

Addresses spurious reasoning in multimodal language models
Generates rubric-based rewards without human annotation
Improves reasoning faithfulness through process supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses rubric-based generative rewards for supervision
Self-aggregates reasoning checkpoints from trajectories
Combines rubric and outcome rewards for performance
🔎 Similar Papers
No similar papers found.
Mengzhao Jia
Mengzhao Jia
University of Notre Dame
Zhihan Zhang
Zhihan Zhang
PhD student, University of Notre Dame
Natural Language Processing
Ignacio Cases
Ignacio Cases
Uniphore
Z
Zheyuan Liu
University of Notre Dame
M
Meng Jiang
University of Notre Dame
P
Peng Qi
Uniphore