Reinforcement Learning with Robust Rubric Rewards

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the challenge of providing effective reinforcement learning rewards for vision-language tasks, which are often only partially verifiable. To this end, the authors propose a criterion-level verification framework that refines reward modeling from the task level down to multidimensional scoring criteria. The approach integrates deterministic verifiers with large language model judges and incorporates a minimal exposure strategy to prevent information leakage. A hierarchical reward aggregation mechanism prioritizes critical scoring dimensions, while intra-rollout saturation suppression mitigates the risk of false positives. Evaluated on the Qwen3-VL-30B-A3B model, the method achieves an average improvement of 4.7 points across 15 benchmarks, significantly outperforming RLVR and substantially narrowing the performance gap with the official instruction-to-thinking model.

📝 Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Rubric Rewards

Vision-Language Tasks

Multi-criteria Supervision

Partial Verifiability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Robust Rubric Rewards

Criterion-level Verification

Minimal Exposure Strategy