Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenge of reward polarization in reinforcement learning with large language models under multidimensional scoring criteria, which often leads to severe performance degradation in certain dimensions. To mitigate this issue, the authors propose Focal Reward, a novel online reward reallocation mechanism that dynamically assesses the saturation level of each dimension via inverse reward projection and adaptively reweights rewards to prioritize dimensions with remaining improvement potential. Evaluated across three model scales and six benchmarks, the method consistently outperforms the strongest static aggregation baselines, achieving superior performance in all 18 comparisons and effectively alleviating multidimensional reward imbalance.

📝 Abstract

The open-ended generation in LLMs usually requires multi-dimensional rubrics to adequately assess quality and guide the improvement of reinforcement learning. However, a critical dilemma inherent in this training paradigm is the imbalanced reward polarization along different rubric dimensions. Under this bottleneck, even if LLMs achieve relatively high rewards after training, they may still exhibit severe deficiencies in certain dimensions, leading to a direct deterioration in user experience. To address this problem, we propose Focal Reward, a novel objective to automatically balance the training of reinforcement learning under rubric-based rewards. Specifically, we first leverage an inverse reward projection mechanism to estimate the saturation degree of each criterion in the rubric, which forms the basis to calibrate the reward direction. Then, the final objective is designed with an automatically reweighting coefficient for each criterion to achieve the fine-grained balancing. Extensive experiments across three model scales and six benchmarks demonstrate that our Focal Reward method outperforms the strongest static aggregation baseline in all 18 model-benchmark comparisons. Rollout, mechanism, and ablation analyses further show that these gains arise from online, saturation-aware reallocation toward rubrics that still have room for improvement.

Problem

Research questions and friction points this paper is trying to address.

reward polarization

rubric-based rewards

reinforcement learning

balanced training

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Focal Reward

rubric-based rewards

reward balancing