Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language model (LLM) evaluation benchmarks suffer from declining credibility due to score inflation and selective reporting. To address this, this work proposes the Benchmark Health Index (BHI), the first data-driven framework to quantitatively assess benchmark “health” at a macro level. BHI constructs a multidimensional metric system grounded in three orthogonal dimensions—discriminative power, saturation resistance, and influence—leveraging performance data from 91 representative models across 106 validated benchmarks. This approach systematically maps the LLM evaluation ecosystem, identifies high-health benchmarks, and provides principled foundations and actionable guidance for benchmark selection, dynamic maintenance, and the design of future evaluation protocols.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health Index (BHI), a pure data-driven framework for auditing evaluation sets along three orthogonal and complementary axes: (1) Capability Discrimination, measuring how sharply a benchmark separates model performance beyond noise; (2) Anti-Saturation, estimating remaining headroom before ceiling effects erode resolution and thus the benchmark's expected longevity; and (3) Impact, quantifying influence across academic and industrial ecosystems via adoption breadth and practice-shaping power. By distilling 106 validated benchmarks from the technical reports of 91 representative models in 2025, we systematically characterize the evaluation landscape. BHI is the first framework to quantify benchmark health at a macro level, providing a principled basis for benchmark selection and enabling dynamic lifecycle management for next-generation evaluation protocols.
Problem

Research questions and friction points this paper is trying to address.

benchmark reliability
score inflation
evaluation trustworthiness
LLM evaluation
benchmark degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark Health Index
Capability Discrimination
Anti-Saturation
Impact
LLM evaluation
🔎 Similar Papers
No similar papers found.
L
Longyuan Zhu
Alibaba Group, Skylenage
H
Hairan Hua
Alibaba Group, Skylenage
L
Linlin Miao
Alibaba Group
Bing Zhao
Bing Zhao
SRI International
Natural Language ProcessingMachine LearningOptimizations