Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Current LLM evaluation benchmarks suffer from excessive scale, low efficiency, and the absence of a unified framework integrating redundancy elimination with performance prediction. To address this, we propose EssenceBench—a coarse-to-fine benchmark compression framework that jointly optimizes sample-level redundancy reduction and score reconstruction. It integrates three core components: (1) sample-level redundancy analysis, (2) attribution-driven sample selection, and (3) fitness-guided genetic algorithm optimization. EssenceBench is the first to systematically unify redundancy elimination and faithful score reconstruction, enabling high-fidelity model evaluation at extremely low data budgets. On HellaSwag, it preserves full model ranking (ranking shift <5%) using only 4% of the original samples, and maintains 95% ranking consistency even with just 0.5%—substantially outperforming existing methods. Its key contribution lies in establishing a generalizable, lightweight evaluation paradigm that simultaneously ensures accuracy, ranking consistency, and computational efficiency.

Technology Category

Application Category

📝 Abstract

As the demand for comprehensive evaluations of diverse model capabilities steadily increases, benchmark suites have correspondingly grown significantly in scale. Despite notable advances in redundancy reduction and subset-level performance prediction, a systematic framework that effectively integrates these methods to ensure both prediction accuracy and ranking consistency is still largely elusive. In this paper, we first perform a sample-level analysis of benchmark redundancy and identify several highly similar samples that can be eliminated. Besides, we frame benchmark compression as an optimization problem with the aim of score reconstruction. Building on these, we then propose EssenceBench, a coarse-to-fine framework utilizing an iterative Genetic Algorithm (GA), which takes the advantages of fitness-based subset search and attribution-based sample search. Compared to previous methods, our approach yields superior compression results with lower reconstruction error and markedly higher efficiency. In particular, on the HellaSwag benchmark (10K samples), our method preserves the ranking of all models shifting within 5% using 25x fewer samples, and achieves 95% ranking preservation shifting within 5% using only 200x fewer samples.

Problem

Research questions and friction points this paper is trying to address.

Reducing benchmark data needs for efficient LLM evaluation

Addressing redundancy in large-scale model evaluation benchmarks

Optimizing benchmark compression while preserving ranking accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Genetic Algorithm optimizes benchmark compression

Coarse-to-fine framework reduces data requirements

Sample elimination maintains ranking with fewer samples

🔎 Similar Papers

No similar papers found.