Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current large language model (LLM) evaluation benchmarks suffer from declining credibility due to score inflation and selective reporting. To address this, this work proposes the Benchmark Health Index (BHI), the first data-driven framework to quantitatively assess benchmark “health” at a macro level. BHI constructs a multidimensional metric system grounded in three orthogonal dimensions—discriminative power, saturation resistance, and influence—leveraging performance data from 91 representative models across 106 validated benchmarks. This approach systematically maps the LLM evaluation ecosystem, identifies high-health benchmarks, and provides principled foundations and actionable guidance for benchmark selection, dynamic maintenance, and the design of future evaluation protocols.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health Index (BHI), a pure data-driven framework for auditing evaluation sets along three orthogonal and complementary axes: (1) Capability Discrimination, measuring how sharply a benchmark separates model performance beyond noise; (2) Anti-Saturation, estimating remaining headroom before ceiling effects erode resolution and thus the benchmark's expected longevity; and (3) Impact, quantifying influence across academic and industrial ecosystems via adoption breadth and practice-shaping power. By distilling 106 validated benchmarks from the technical reports of 91 representative models in 2025, we systematically characterize the evaluation landscape. BHI is the first framework to quantify benchmark health at a macro level, providing a principled basis for benchmark selection and enabling dynamic lifecycle management for next-generation evaluation protocols.

Problem

Research questions and friction points this paper is trying to address.

benchmark reliability

score inflation

evaluation trustworthiness

LLM evaluation

benchmark degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark Health Index

Capability Discrimination

Anti-Saturation

Impact

LLM evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow