🤖 AI Summary
Existing LLM judge evaluation benchmarks inadequately assess judges’ ability to discern factual accuracy and logical correctness in knowledge, reasoning, mathematics, and programming tasks.
Method: We introduce the first objective-correctness–oriented LLM judge benchmark, featuring an automated pipeline for response pairing and preference labeling derived from diverse challenging sources—including MMLU, GSM8K, and HumanEval—where ground-truth factual and logical correctness serves as the sole, verifiable evaluation criterion, eliminating reliance on human preferences.
Contribution/Results: Our framework enables rigorous evaluation of strong judge models (e.g., GPT-4o) across mainstream paradigms: prompt engineering, fine-tuning, multi-agent systems, and reward modeling. Experiments reveal that state-of-the-art judge models achieve only ~55% accuracy—substantially below human performance—demonstrating the benchmark’s high difficulty and validity in exposing critical limitations in current judge capabilities.
📝 Abstract
LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench.