Reinforcement Learning with Rubric Anchors

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Constrained by the absence of automatically verifiable rewards in open-ended tasks, reinforcement learning (RL) struggles to generalize to subjective generation domains. This paper introduces Rubric Anchors—a novel paradigm that extends reward-based RL to such tasks. Its core innovation is the construction of the first large-scale, structured rubric corpus (>10,000 entries), co-authored by human experts and LLMs, alongside a rubric-driven reward modeling framework enabling automated, fine-grained evaluation and policy optimization for open-domain outputs. Experiments on Qwen-30B-A3B demonstrate a 5.2% improvement on open-domain benchmarks using only 5K+ samples—surpassing the 671B-parameter DeepSeek-V3 by 2.4%. Moreover, the method significantly enhances performance on humanities-oriented tasks and improves controllability over expressive, human-like stylistic attributes.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues with a clear framework and present an open-sourced Qwen-30B-A3B model with notable gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by +2.4%, while preserving general and reasoning abilities. 2) Our method provides fine-grained stylistic control, using rubrics as anchors to mitigate the "AI-like" tone and produce more human-like, expressive responses. We share key lessons in rubric construction, data selection, and training, and discuss limitations and future releases.

Problem

Research questions and friction points this paper is trying to address.

Extends RLVR to open-ended tasks using rubric-based rewards

Constructs largest rubric reward system with 10,000+ human-LLM rubrics

Enhances model performance and stylistic control in subjective outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rubric-based rewards for open-ended tasks

Largest rubric reward system with 10K+ rubrics

Qwen-30B-A3B model with fine-grained stylistic control

🔎 Similar Papers

A Review of Reward Functions for Reinforcement Learning in the context of Autonomous Driving