Reinforcement Learning with Robust Rubric Rewards

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of providing effective reinforcement learning rewards for vision-language tasks, which are often only partially verifiable. To this end, the authors propose a criterion-level verification framework that refines reward modeling from the task level down to multidimensional scoring criteria. The approach integrates deterministic verifiers with large language model judges and incorporates a minimal exposure strategy to prevent information leakage. A hierarchical reward aggregation mechanism prioritizes critical scoring dimensions, while intra-rollout saturation suppression mitigates the risk of false positives. Evaluated on the Qwen3-VL-30B-A3B model, the method achieves an average improvement of 4.7 points across 15 benchmarks, significantly outperforming RLVR and substantially narrowing the performance gap with the official instruction-to-thinking model.
📝 Abstract
While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Rubric Rewards
Vision-Language Tasks
Multi-criteria Supervision
Partial Verifiability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Robust Rubric Rewards
Criterion-level Verification
Minimal Exposure Strategy
Hierarchical Aggregation
LLM-as-Judge
🔎 Similar Papers
No similar papers found.
Ya-Qi Yu
Ya-Qi Yu
Huawei Technologies
Computer VisionMachine Learning
Hao Wang
Hao Wang
Ph.D from Peking University 2017 then Joined Huawei
Pervasive ComputingDevice-free Wi-Fi Human SensingMobile SensingSmart Environments
F
Fangyu Hong
Huawei Technologies Co., Ltd.
X
Xiangyang Qu
Huawei Technologies Co., Ltd.
G
Gaojie Wu
Huawei Technologies Co., Ltd.
Q
Qiaoyu Luo
Huawei Technologies Co., Ltd.
N
Nuo Xu
Huawei Technologies Co., Ltd.
H
Huixin Wang
Huawei Technologies Co., Ltd.
W
Wuheng Xu
Huawei Technologies Co., Ltd.
Y
Yongxin Liao
Huawei Technologies Co., Ltd.
Z
Zihao Chen
Huawei Technologies Co., Ltd.
H
Haonan Li
Huawei Technologies Co., Ltd.
Z
Ziming Li
Huawei Technologies Co., Ltd.
Dezhi Peng
Dezhi Peng
Huawei Technologies, South China University of Technology
Computer Vision
M
Minghui Liao
Huawei Technologies Co., Ltd.
Jihao Wu
Jihao Wu
Huawei Inc.
Computer VisionMulti-Modality
H
Haoyu Ren
Huawei Technologies Co., Ltd.
D
Dandan Tu
Huawei Technologies Co., Ltd.