EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and redundancy in large language model (LLM) evaluation, this paper proposes a training-free, efficient evaluation method. Our approach introduces, for the first time, the “capability coverage maximization” principle, which leverages the Model Utility Index (MUI) to adaptively select a representative subset of test samples that jointly maximize diversity and informativeness. This ensures unbiased performance estimation and enables cross-dataset and cross-model-family transferability without requiring large-scale annotations or fine-tuning. Extensive experiments across multiple mainstream benchmarks and LLMs demonstrate that our method achieves highly consistent model rankings (Kendall’s τ > 0.92) using only 5–10% of the full test set—substantially improving evaluation efficiency while preserving reliability and fidelity.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large language models (LLMs) and the development of increasingly large and diverse evaluation benchmarks have introduced substantial computational challenges for model assessment. In this paper, we present EffiEval, a training-free approach for efficient benchmarking that effectively addresses data redundancy while maintaining high evaluation reliability. Our method is specifically designed to meet three key criteria for high-quality evaluation: representativeness, by ensuring comprehensive coverage of model capabilities; fairness, by remaining independent of model performance during sample selection to avoid bias; and generalizability, by enabling flexible transfer across datasets and model families without reliance on large-scale evaluation data. Unlike traditional methods that rely on absolute performance or require extensive evaluation data, our approach adaptively selects high-quality representative subsets based on the Model Utility Index (MUI). Extensive experiments on multiple public benchmarks and diverse LLMs demonstrate that EffiEval achieves strong ranking consistency with full-dataset evaluation using only a small fraction of the original data. Furthermore, our method is flexible and scalable in size, allowing users to balance evaluation efficiency and representativeness according to specific needs. Overall, EffiEval provides a practical and generalizable solution for reliable, fair, and efficient evaluation in the era of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Addresses computational challenges in large language model evaluation
Ensures evaluation reliability by reducing data redundancy
Provides flexible transfer across datasets and model families
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free efficient benchmarking via capability coverage
Adaptive subset selection using Model Utility Index
Flexible transfer across datasets and models
🔎 Similar Papers
No similar papers found.
Y
Yaoning Wang
Institute of Trustworthy Embodied AI, Fudan University
J
Jiahao Ying
Singapore Management University
Y
Yixin Cao
Institute of Trustworthy Embodied AI, Fudan University
Yubo Ma
Yubo Ma
Nanyang Technological University
Event ExtractionInformation ExtractionNatural Language Processing
Y
Yugang Jiang
Institute of Trustworthy Embodied AI, Fudan University