Do Large Language Model Benchmarks Test Reliability?

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing LLM benchmarks prioritize capability over reliability and are widely compromised by label noise, obscuring persistent model failures. Method: This work introduces the “Platinum Benchmark” concept—the first systematic effort to revise samples across 15 mainstream benchmarks via human curation, ambiguity analysis, and multi-model consensus verification to establish a high-confidence annotation pipeline. Contribution/Results: Comprehensive evaluation across GPT-4, Claude, Llama, and other state-of-the-art models reveals significant, patterned failures even on elementary tasks such as primary-school math word problems. The study uncovers multiple previously overlooked systematic weaknesses in reasoning, consistency, and robustness. By providing both a rigorous methodology and a high-quality, reliability-focused benchmark resource, this work advances the principled assessment of LLM trustworthiness and establishes a new standard for evaluating model reliability beyond mere accuracy.

Technology Category

Application Category

📝 Abstract

When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs' growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior. Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle. We provide code at https://github.com/MadryLab/platinum-benchmarks

Problem

Research questions and friction points this paper is trying to address.

Assess reliability in large language models.

Identify label errors in current benchmarks.

Propose platinum benchmarks for accuracy.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Platinum benchmarks minimize label errors

Revision of popular benchmarks for reliability

Analysis reveals frontier LLMs consistent failures

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

2024-06-26arXiv.orgCitations: 69

Microsoft

$6,710 -

San Francisco Bay area / New York City metropolitan area

Research Scientist Intern, Multimodal AI (PhD)