Reliable and Efficient Amortized Model-based Evaluation

๐Ÿ“… 2025-03-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the high cost and inaccurate difficulty control in frequent, multidimensional, security-sensitive evaluation of large language models (LLMs), this paper proposes a difficulty-aware adaptive evaluation paradigm. Unlike unreliable random sampling or classical Item Response Theory (IRT)โ€”which incurs prohibitive calibration costsโ€”our approach integrates a content-based item difficulty prediction model with a conditional difficulty-controllable text generator, enabling precise difficulty modeling and dynamic adjustment within the IRT framework. Empirical evaluation across 22 mainstream NLP benchmarks and 172 LLMs demonstrates that our method significantly improves assessment reliability over random subset selection, reduces computational overhead by multiple orders of magnitude, and supports real-time, dynamic, and interpretable LLM evaluation. To our knowledge, this is the first work to achieve a unified balance between high reliability, validity, and efficiency in large-scale LLM assessment.

Technology Category

Application Category

๐Ÿ“ Abstract
Comprehensive evaluations of language models (LM) during both development and deployment phases are necessary because these models possess numerous capabilities (e.g., mathematical reasoning, legal support, or medical diagnostic) as well as safety risks (e.g., racial bias, toxicity, or misinformation). The average score across a wide range of benchmarks provides a signal that helps guide the use of these LMs in practice. Currently, holistic evaluations are costly due to the large volume of benchmark questions, making frequent evaluations impractical. A popular attempt to lower the cost is to compute the average score on a subset of the benchmark. This approach, unfortunately, often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset. Item response theory (IRT) was designed to address this challenge, providing a reliable measurement by careful controlling for question difficulty. Unfortunately, question difficulty is expensive to estimate. Facing this challenge, we train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost. In addition, we leverage this difficulty predictor to further improve the evaluation efficiency through training a question generator given a difficulty level. This question generator is essential in adaptive testing, where, instead of using a random subset of the benchmark questions, informative questions are adaptively chosen based on the current estimation of LLM performance. Experiments on 22 common natural language benchmarks and 172 LMs show that this approach is more reliable and efficient compared to current common practice.
Problem

Research questions and friction points this paper is trying to address.

High cost of comprehensive language model evaluations.
Unreliable performance measurement due to benchmark subset difficulty.
Need for efficient and reliable adaptive testing methods.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts question difficulty from content.
Trains question generator for adaptive testing.
Improves evaluation reliability and efficiency.
๐Ÿ”Ž Similar Papers
No similar papers found.