π€ AI Summary
This work addresses the inefficiency of conventional reward-based reinforcement learning, which employs static weighting to aggregate human-defined criteria, conflating their prescribed importance with their actual utility during different training stages. To resolve this, the authors propose POW3R, a novel framework that introduces a policy-aware dynamic criterion weighting mechanism. POW3R adaptively amplifies reward signals from high-discriminability criteria based on policy output divergence, thereby decoupling the teaching signal from the final evaluation objective while preserving the latter. Built upon the GRPO algorithm and incorporating rollout-level contrastive analysis with multi-criterion reward modeling, POW3R significantly outperforms baselines across text and multimodal tasksβwinning 24 out of 30 comparisons across two datasets and three policies, achieving higher average rating rewards and strict completion rates, and reaching equivalent performance in 2.5β4Γ fewer training steps.
π Abstract
Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins $24$ of $30$ base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in $2.5$--$4\times$ fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.