The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current benchmark evaluations suffer from an “averaging bias,” where overall accuracy is disproportionately influenced by a few subdomains, obscuring models’ true performance heterogeneity across others. Method: The authors propose *harmony*—a novel metric quantifying the mean-variance balance of model performance across subdomains, grounded in the uniformity of performance distributions. This distributional perspective enables systematic assessment of benchmark reliability. Contribution/Results: The study constructs the first large-scale analytical framework spanning 19 multiple-choice benchmarks and 5 model families. It reveals that widely used benchmarks—including ARC-Easy—are severely compromised by subdomain imbalance. Empirical findings demonstrate that reporting both high accuracy *and* high harmony is essential to ensure evaluation robustness and interpretability. The work advocates a paradigm shift from scalar accuracy metrics toward multidimensional, distribution-aware evaluation standards, thereby advancing the rigor and transparency of model assessment.

Technology Category

Application Category

📝 Abstract
Benchmarks shape scientific conclusions about model capabilities and steer model development. This creates a feedback loop: stronger benchmarks drive better models, and better models demand more discriminative benchmarks. Ensuring benchmark reliability is therefore essential for trustworthy evaluation and meaningful progress. In this work, we study benchmark reliability from a distributional perspective and introduce benchmark harmony, which measures how uniformly a model's performance is distributed across the subdomains of a benchmark. We posit that high harmony is a desirable benchmark property, indicating that the aggregate metric reflects uniform competence across subdomains. Across 19 multiple-choice benchmarks and five model families, we map each benchmark onto a mean-variance plane of harmony computed across models, where high mean and low variance signal more reliable evaluation. Our analysis shows that less harmonious benchmarks can give misleading results, since overall accuracy may be disproportionately influenced by specific subdomains. For instance, ARC-Easy is overwhelmed by questions on Biological Concepts, overshadowing other critical subdomains such as Geography, Physics, Chemistry, and Environmental Science. By recommending that harmony should be reported alongside accuracy, we reframe evaluation from simple performance averages to a more robust, distributionally reliable measurement of performance.
Problem

Research questions and friction points this paper is trying to address.

Measuring uniform performance distribution across benchmark subdomains
Identifying misleading aggregate accuracy from skewed subdomain influence
Proposing harmony metric to replace simple performance averages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Measuring performance uniformity across benchmark subdomains
Mapping benchmarks on mean-variance harmony plane
Recommending harmony reporting alongside accuracy metrics
🔎 Similar Papers
No similar papers found.