Do Large Language Model Benchmarks Test Reliability?

📅 2025-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM benchmarks prioritize capability over reliability and are widely compromised by label noise, obscuring persistent model failures. Method: This work introduces the “Platinum Benchmark” concept—the first systematic effort to revise samples across 15 mainstream benchmarks via human curation, ambiguity analysis, and multi-model consensus verification to establish a high-confidence annotation pipeline. Contribution/Results: Comprehensive evaluation across GPT-4, Claude, Llama, and other state-of-the-art models reveals significant, patterned failures even on elementary tasks such as primary-school math word problems. The study uncovers multiple previously overlooked systematic weaknesses in reasoning, consistency, and robustness. By providing both a rigorous methodology and a high-quality, reliability-focused benchmark resource, this work advances the principled assessment of LLM trustworthiness and establishes a new standard for evaluating model reliability beyond mere accuracy.

Technology Category

Application Category

📝 Abstract
When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs' growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior. Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle. We provide code at https://github.com/MadryLab/platinum-benchmarks
Problem

Research questions and friction points this paper is trying to address.

Assess reliability in large language models.
Identify label errors in current benchmarks.
Propose platinum benchmarks for accuracy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Platinum benchmarks minimize label errors
Revision of popular benchmarks for reliability
Analysis reveals frontier LLMs consistent failures
🔎 Similar Papers
No similar papers found.