JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

📅 2026-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In highly specialized domains lacking ground-truth answers, it remains unclear how to effectively evaluate the quality of generated content and whether scoring rubrics or pairwise preferences constitute more suitable supervision signals. This work introduces JudgmentBench, a benchmark comprising 30 real-world legal tasks, where the same cohort of experienced lawyers provides both rubric-based scores and pairwise preference annotations for three quality tiers of outputs generated by large language models. Empirical analysis demonstrates that pairwise preference judgments substantially outperform rubric-based scoring in both validity—evidenced by a Spearman correlation coefficient of 0.908 versus 0.150—and efficiency, requiring less than half the annotation time. These findings hold consistently across both human and automated evaluators. The study delivers the first dual-modality expert-annotated dataset and methodological foundation for evaluation in high-expertise domains.
📝 Abstract
Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies are widely used, the choice between them is rarely justified. We release JudgmentBench, a benchmark of 30 real-world legal tasks, paired with 1,539 rubric scores and 1,530 pairwise preference judgments collected from practicing attorneys--including at major U.S. law firms--with substantial experience. The annotations constitute the first publicly available dataset in a high-expertise domain in which both supervision signals are elicited from the same experts on the same items. Using LLM-generated outputs at three constructed quality levels, we provide an initial empirical comparison: comparative judgments recover the intended quality ordering substantially better than rubrics (mean Spearman's rank correlation of 0.908 vs. 0.150, estimated difference = 0.758 [0.494, 1.021]) while requiring less than half the annotation time. The patterns hold for human annotators and LLM autograders. Beyond this initial comparison, the paired structure of the dataset supports a broader research agenda on how expert judgment should be elicited, aggregated, and used as supervision in domains without verifiable ground truth.
Problem

Research questions and friction points this paper is trying to address.

rubric-based scoring
comparative judgment
quality assessment
expert judgment
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

JudgmentBench
rubric-based scoring
comparative judgment
expert annotation
LLM evaluation