RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

πŸ“… 2026-02-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the susceptibility of multimodal large language models (MLLMs) to reward hacking in reinforcement learning, a problem exacerbated by reliance solely on outcome-based rewards and the high computational cost of existing scoring-based approaches that hinder training efficiency. To mitigate this, the authors propose RuCL, a novel hierarchical scoring rubric curriculum learning framework that shifts the focus of curriculum design from data selection to reward shaping. RuCL generates generalizable scoring rubrics and dynamically adjusts their hierarchy and weighting based on the model’s evolving capabilities, guiding learning from basic perceptual understanding to complex logical reasoning. This approach effectively reduces reward hacking while significantly improving training efficiency, achieving an average performance gain of 7.83% across multiple visual reasoning benchmarks and setting a new state-of-the-art accuracy of 60.06% on Qwen2.5-VL-7B.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a prevailing paradigm for enhancing reasoning in Multimodal Large Language Models (MLLMs). However, relying solely on outcome supervision risks reward hacking, where models learn spurious reasoning patterns to satisfy final answer checks. While recent rubric-based approaches offer fine-grained supervision signals, they suffer from high computational costs of instance-level generation and inefficient training dynamics caused by treating all rubrics as equally learnable. In this paper, we propose Stratified Rubric-based Curriculum Learning (RuCL), a novel framework that reformulates curriculum learning by shifting the focus from data selection to reward design. RuCL generates generalized rubrics for broad applicability and stratifies them based on the model's competence. By dynamically adjusting rubric weights during training, RuCL guides the model from mastering foundational perception to tackling advanced logical reasoning. Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.
Problem

Research questions and friction points this paper is trying to address.

reward hacking
rubric-based supervision
multimodal large language models
curriculum learning
reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum Learning
Rubric-based Supervision
Multimodal Reasoning
Reward Design
Stratified Training
πŸ”Ž Similar Papers
No similar papers found.