Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of traditional scalar reward models, which compress multidimensional human preferences into a single score, often leading to reward hacking and alignment fragility. The authors propose an LLM-as-a-Judge framework grounded in explicit scoring rules, featuring Pairwise Adaptive Meta-Rubrics (PAMR) that dynamically generate adaptive evaluation criteria and Pointwise Verifiable Rubrics (PVRs) that provide verifiable constraints and reward signals. By reframing reward modeling as an auditable, explicit reasoning process—rather than an opaque implicit function—the approach incorporates a two-stage meta-rubric refinement mechanism (combining automated evolution and human feedback) and criterion-level pairwise comparisons to avoid information loss from scalar weighting. Experiments demonstrate that this framework significantly enhances reward discriminability and alignment robustness in open-domain tasks, effectively suppresses degenerate behaviors, and delivers reliable reward signals for verifiable subtasks.

Technology Category

Application Category

📝 Abstract
Scalar reward models compress multi-dimensional human preferences into a single opaque score, creating an information bottleneck that often leads to brittleness and reward hacking in open-ended alignment. We argue that robust alignment for non-verifiable tasks is fundamentally a principle generalization problem: reward should not be a learned function internalized into a judge, but an explicit reasoning process executed under inspectable principles. To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which provide both hard-constraint guardrails and verifiable reward components when ground-truth or programmatic checks are available. OpenRS uses an explicit meta-rubric -- a constitution-like specification that governs how rubrics are instantiated, weighted, and enforced -- and instantiates adaptive rubrics on the fly by conditioning on the semantic differences between two candidate responses. It then performs criterion-wise pairwise comparisons and aggregates criterion-level preferences externally, avoiding pointwise weighted scalarization while improving discriminability in open-ended settings. To keep principles consistent yet editable across various domains, we introduce a two-level meta-rubric refinement pipeline (automated evolutionary refinement for general principles and a reproducible human-in-the-loop procedure for domain principles), complemented with pointwise verifiable rubrics that act as both guardrails against degenerate behaviors and a source of verifiable reward for objective sub-tasks. Finally, we instantiate OpenRS as reward supervision in pairwise RL training.
Problem

Research questions and friction points this paper is trying to address.

reward hacking
human preference alignment
scalar reward bottleneck
non-verifiable tasks
robust alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open Rubric System
Pairwise Adaptive Meta-Rubrics
Verifiable Rubrics
LLM-as-a-Judge
Reinforcement Learning Alignment
🔎 Similar Papers
No similar papers found.
R
Ruipeng Jia
Qwen Large Model Application Team, Alibaba
Yunyi Yang
Yunyi Yang
Sun Yat-Sen University
NLP
Y
Yuxin Wu
Beijing University Of Posts and Telecommunications
Y
Yongbo Gai
Qwen Large Model Application Team, Alibaba
S
Siyuan Tao
Institute of Computing Technology, Chinese Academy of Sciences
Mengyu Zhou
Mengyu Zhou
Microsoft Research
Data analyticsNatural Language ProcessingNetwork ScienceHuman BehaviorsMobile & Ubiquitous Computing
J
Jianhe Lin
Qwen Large Model Application Team, Alibaba
X
Xiaoxi Jiang
Qwen Large Model Application Team, Alibaba
G
Guanjun Jiang
Qwen Large Model Application Team, Alibaba