RubricBench: Aligning Model-Generated Rubrics with Human Standards

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of a unified benchmark for reliably evaluating rubric-based model assessments, particularly with respect to sample complexity and authentic rubric annotations. The authors propose RubricBench, a high-quality benchmark comprising 1,147 paired comparisons, which employs a multi-dimensional difficulty sampling mechanism to identify challenging instances characterized by input complexity and superficial biases. RubricBench introduces expert-annotated, atomic rubrics that strictly derive from original instructions, enabling, for the first time, systematic evaluation of the alignment between model-generated and human-defined rubrics. Experimental results demonstrate that even state-of-the-art models significantly underperform humans in automatically generating valid rubrics and fail to replicate the evaluation performance achieved under human guidance.

Technology Category

Application Category

📝 Abstract
As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.
Problem

Research questions and friction points this paper is trying to address.

rubric-based evaluation
benchmark
LLM alignment
reward models
human-annotated rubrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

RubricBench
rubric-guided evaluation
reward modeling
LLM alignment
expert-annotated rubrics