JudgeBench: A Benchmark for Evaluating LLM-based Judges

📅 2024-10-16

🏛️ arXiv.org

📈 Citations: 22

✨ Influential: 4

career value

189K/year

🤖 AI Summary

Existing LLM judge evaluation benchmarks inadequately assess judges’ ability to discern factual accuracy and logical correctness in knowledge, reasoning, mathematics, and programming tasks. Method: We introduce the first objective-correctness–oriented LLM judge benchmark, featuring an automated pipeline for response pairing and preference labeling derived from diverse challenging sources—including MMLU, GSM8K, and HumanEval—where ground-truth factual and logical correctness serves as the sole, verifiable evaluation criterion, eliminating reliance on human preferences. Contribution/Results: Our framework enables rigorous evaluation of strong judge models (e.g., GPT-4o) across mainstream paradigms: prompt engineering, fine-tuning, multi-agent systems, and reward modeling. Experiments reveal that state-of-the-art judge models achieve only ~55% accuracy—substantially below human performance—demonstrating the benchmark’s high difficulty and validity in exposing critical limitations in current judge capabilities.

Technology Category

Application Category

📝 Abstract

LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating reliability of LLM-based judges objectively

Assessing judges on challenging tasks beyond human preferences

Creating a benchmark for advanced LLM judge evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel evaluation framework for LLM judges

Benchmark with challenging response pairs

Pipeline converts datasets into labeled pairs

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks