🤖 AI Summary
This work addresses the limited reliability of current rubric-based instruction-following evaluations, which lack fine-grained meta-evaluation of annotator judgment accuracy. To bridge this gap, we propose the first rubric-level meta-evaluation benchmark, comprising a high-quality dataset of 3,486 human-annotated instances. We establish a taxonomy of scoring rubrics and partition the data into Easy and Hard subsets to differentiate model evaluation capabilities. Through the incorporation of explicit reasoning mechanisms, we systematically assess large language models under both rubric-level and checklist-level paradigms. Experimental results reveal that even the strongest current model, GPT-4o, achieves only 55.97% accuracy on the Hard subset. Moreover, rubric-level evaluation consistently outperforms checklist-level evaluation, and explicit reasoning significantly enhances judgment accuracy while reducing inter-annotator variance.
📝 Abstract
Rubric-based evaluation has become a prevailing paradigm for evaluating instruction following in large language models (LLMs). Despite its widespread use, the reliability of these rubric-level evaluations remains unclear, calling for meta-evaluation. However, prior meta-evaluation efforts largely focus on the response level, failing to assess the fine-grained judgment accuracy that rubric-based evaluation relies on. To bridge this gap, we introduce RubricEval. Our benchmark features: (1) the first rubric-level meta-evaluation benchmark for instruction following, (2) diverse instructions and responses spanning multiple categories and model sources, and (3) a substantial set of 3,486 quality-controlled instances, along with Easy/Hard subsets that better differentiates judge performance. Our experiments reveal that rubric-level judging remains far from solved: even GPT-4o, a widely adopted judge in instruction-following benchmarks, achieves only 55.97% on Hard subset. Considering evaluation paradigm, rubric-level evaluation outperforms checklist-level, explicit reasoning improves accuracy, and both together reduce inter-judge variance. Through our established rubric taxonomy, we further identify common failure modes and offer actionable insights for reliable instruction-following evaluation.