Reinforcement Learning with Rubric Anchors

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Constrained by the absence of automatically verifiable rewards in open-ended tasks, reinforcement learning (RL) struggles to generalize to subjective generation domains. This paper introduces Rubric Anchors—a novel paradigm that extends reward-based RL to such tasks. Its core innovation is the construction of the first large-scale, structured rubric corpus (>10,000 entries), co-authored by human experts and LLMs, alongside a rubric-driven reward modeling framework enabling automated, fine-grained evaluation and policy optimization for open-domain outputs. Experiments on Qwen-30B-A3B demonstrate a 5.2% improvement on open-domain benchmarks using only 5K+ samples—surpassing the 671B-parameter DeepSeek-V3 by 2.4%. Moreover, the method significantly enhances performance on humanities-oriented tasks and improves controllability over expressive, human-like stylistic attributes.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues with a clear framework and present an open-sourced Qwen-30B-A3B model with notable gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by +2.4%, while preserving general and reasoning abilities. 2) Our method provides fine-grained stylistic control, using rubrics as anchors to mitigate the "AI-like" tone and produce more human-like, expressive responses. We share key lessons in rubric construction, data selection, and training, and discuss limitations and future releases.
Problem

Research questions and friction points this paper is trying to address.

Extends RLVR to open-ended tasks using rubric-based rewards
Constructs largest rubric reward system with 10,000+ human-LLM rubrics
Enhances model performance and stylistic control in subjective outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rubric-based rewards for open-ended tasks
Largest rubric reward system with 10K+ rubrics
Qwen-30B-A3B model with fine-grained stylistic control
🔎 Similar Papers
2024-04-122024 IEEE Intelligent Vehicles Symposium (IV)Citations: 8
Zenan Huang
Zenan Huang
Ant Research
Machine LearningCausalityLLMs
Y
Yihong Zhuang
Inclusion AI, Ant Group, Zhejiang University
Guoshan Lu
Guoshan Lu
Zhejiang University
Machine LearningLLM
Zeyu Qin
Zeyu Qin
Hong Kong University of Science and Technology
Machine LearningDeep LearningScalable OversightAI Safety
H
Haokai Xu
Inclusion AI, Ant Group, Zhejiang University
T
Tianyu Zhao
Inclusion AI, Ant Group, Zhejiang University
Ru Peng
Ru Peng
Zhejiang University & Qwen Team, Alibaba Group
AILLMs
Jiaqi Hu
Jiaqi Hu
Rice University; Genentech
Artificial IntelligenceDeep Learning
Z
Zhanming Shen
Inclusion AI, Ant Group, Zhejiang University
X
Xiaomeng Hu
Inclusion AI, Ant Group, Zhejiang University
X
Xijun Gu
Inclusion AI, Ant Group, Zhejiang University
P
Peiyi Tu
Inclusion AI, Ant Group, Zhejiang University
J
Jiaxin Liu
Inclusion AI, Ant Group, Zhejiang University
Wenyu Chen
Wenyu Chen
Massachusetts Institute of Technology
optimizationstatistical learning
Y
Yuzhuo Fu
Inclusion AI, Ant Group, Zhejiang University
Z
Zhiting Fan
Inclusion AI, Ant Group, Zhejiang University
Y
Yanmei Gu
Inclusion AI, Ant Group, Zhejiang University
Y
Yuanyuan Wang
Inclusion AI, Ant Group, Zhejiang University
Z
Zhengkai Yang
Inclusion AI, Ant Group, Zhejiang University
Jianguo Li
Jianguo Li
Director, Ant Group
deep learningcomputer visionmachine learningsystem
J
Junbo Zhao
Inclusion AI, Ant Group, Zhejiang University