🤖 AI Summary
This work addresses the challenge of providing effective reinforcement learning rewards for vision-language tasks, which are often only partially verifiable. To this end, the authors propose a criterion-level verification framework that refines reward modeling from the task level down to multidimensional scoring criteria. The approach integrates deterministic verifiers with large language model judges and incorporates a minimal exposure strategy to prevent information leakage. A hierarchical reward aggregation mechanism prioritizes critical scoring dimensions, while intra-rollout saturation suppression mitigates the risk of false positives. Evaluated on the Qwen3-VL-30B-A3B model, the method achieves an average improvement of 4.7 points across 15 benchmarks, significantly outperforming RLVR and substantially narrowing the performance gap with the official instruction-to-thinking model.
📝 Abstract
While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.