AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address reasoning unfaithfulness in multimodal large language models (MLLMs) trained via reinforcement learning—where only final-answer rewards are provided—this paper proposes AutoRubric-R1V, a framework that enables explicit supervision of the reasoning process through automatically generated, question-specific rubrics. Its core innovation is a scalable, self-aggregating method that extracts consistent checkpoints from successful reasoning trajectories without requiring human annotations or strong teacher models, thereby constructing faithful process-level supervision signals. Furthermore, it introduces a dual-signal reward mechanism integrating generative process rewards with outcome rewards. Evaluated on six mainstream multimodal reasoning benchmarks, AutoRubric-R1V achieves state-of-the-art performance and demonstrates significant improvements over existing methods on dedicated reasoning faithfulness evaluations.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi-step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final-answer correctness is rewarded. To address this limitation, we propose AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards. Our key innovation lies in a scalable self-aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem-specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric-based and outcome rewards, AutoRubric-R1V achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.

Problem

Research questions and friction points this paper is trying to address.

Addresses spurious reasoning in multimodal language models

Generates rubric-based rewards without human annotation

Improves reasoning faithfulness through process supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses rubric-based generative rewards for supervision

Self-aggregates reasoning checkpoints from trajectories

Combines rubric and outcome rewards for performance

🔎 Similar Papers

No similar papers found.