JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

In highly specialized domains lacking ground-truth answers, it remains unclear how to effectively evaluate the quality of generated content and whether scoring rubrics or pairwise preferences constitute more suitable supervision signals. This work introduces JudgmentBench, a benchmark comprising 30 real-world legal tasks, where the same cohort of experienced lawyers provides both rubric-based scores and pairwise preference annotations for three quality tiers of outputs generated by large language models. Empirical analysis demonstrates that pairwise preference judgments substantially outperform rubric-based scoring in both validity—evidenced by a Spearman correlation coefficient of 0.908 versus 0.150—and efficiency, requiring less than half the annotation time. These findings hold consistently across both human and automated evaluators. The study delivers the first dual-modality expert-annotated dataset and methodological foundation for evaluation in high-expertise domains.

📝 Abstract

Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies are widely used, the choice between them is rarely justified. We release JudgmentBench, a benchmark of 30 real-world legal tasks, paired with 1,539 rubric scores and 1,530 pairwise preference judgments collected from practicing attorneys--including at major U.S. law firms--with substantial experience. The annotations constitute the first publicly available dataset in a high-expertise domain in which both supervision signals are elicited from the same experts on the same items. Using LLM-generated outputs at three constructed quality levels, we provide an initial empirical comparison: comparative judgments recover the intended quality ordering substantially better than rubrics (mean Spearman's rank correlation of 0.908 vs. 0.150, estimated difference = 0.758 [0.494, 1.021]) while requiring less than half the annotation time. The patterns hold for human annotators and LLM autograders. Beyond this initial comparison, the paired structure of the dataset supports a broader research agenda on how expert judgment should be elicited, aggregated, and used as supervision in domains without verifiable ground truth.

Problem

Research questions and friction points this paper is trying to address.

rubric-based scoring

comparative judgment

quality assessment

expert judgment

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

JudgmentBench

rubric-based scoring

comparative judgment

expert annotation

LLM evaluation

🔎 Similar Papers

MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences

2024-10-03arXiv.orgCitations: 5

Review-based Recommender Systems: A Survey of Approaches, Challenges and Future Perspectives

2024-05-09arXiv.orgCitations: 4

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators

2024-03-25arXiv.orgCitations: 32

💼 Related Jobs

No related jobs found.

Authors to Follow