Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Traditional scalarization-based reward aggregation methods struggle to model complex interdependencies among multidimensional scores and are highly sensitive to manually specified weights. This work proposes ARL-RR, an alternating reinforcement learning framework that eschews fixed scalarization by alternately optimizing semantic reward meta-classes and incorporating a task-performance-driven dynamic meta-class scheduling mechanism to effectively capture the underlying multidimensional reward structure. Evaluated on the HealthBench dataset, ARL-RR consistently outperforms existing scalarization approaches across model scales ranging from 1.7B to 14B parameters, achieving simultaneous improvements in both policy performance and training efficiency.

📝 Abstract

Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance gains. We further introduce a lightweight, search-based adaptation procedure that selects the next meta-class dynamically based on task performance, enabling the policy to emphasize critical objectives and thereby improve the model performance. Empirically, our experiments on the HealthBench dataset with experts annotations demonstrate that ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Rubric Rewards

scalarization

multi-dimensional rewards

reward aggregation

contextual rubric

Innovation

Methods, ideas, or system contributions that make the work stand out.

Alternating Reinforcement Learning

Contextual Rubric Rewards

Reward Aggregation