🤖 AI Summary
Language model (LM) benchmarking suffers from high computational costs, inaccurate capability estimation, annotation noise, and score saturation. To address these issues, we propose Fluid Benchmarking—the first systematic integration of Item Response Theory (IRT) and adaptive testing principles from psychometrics into LM evaluation. Our approach constructs a latent ability space, dynamically estimates item difficulty and discrimination parameters, and selects optimal questions based on real-time model performance. This breaks the limitations of static benchmarks, significantly improving evaluation efficiency, validity, and robustness. On MMLU and related benchmarks, Fluid Benchmarking achieves superior performance using only 5% of the total items—outperforming both random sampling and existing IRT-based baselines. It delivers higher reliability and validity, lower score variance, and enables scalable, low-overhead, adaptive LM assessment.
📝 Abstract
Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation. Although various strategies have been proposed to mitigate these issues, they tend to address individual aspects in isolation, neglecting broader questions about overall evaluation quality. Here, we introduce Fluid Benchmarking, a new evaluation approach that advances LM benchmarking across multiple dimensions. Inspired by psychometrics, Fluid Benchmarking is based on the insight that the relative value of benchmark items depends on an LM's capability level, suggesting that evaluation should adapt to each LM. Methodologically, Fluid Benchmarking estimates an item response model based on existing LM evaluation results and uses the inferred quantities to select evaluation items dynamically, similar to computerized adaptive testing in education. In our experiments, we compare Fluid Benchmarking against the common practice of random item sampling as well as more sophisticated baselines, including alternative methods grounded in item response theory. We examine four dimensions -- efficiency, validity, variance, and saturation -- and find that Fluid Benchmarking achieves superior performance in all of them (e.g., higher validity and less variance on MMLU with fifty times fewer items). Our analysis shows that the two components of Fluid Benchmarking have distinct effects: item response theory, used to map performance into a latent ability space, increases validity, while dynamic item selection reduces variance. Overall, our results suggest that LM benchmarking can be substantially improved by moving beyond static evaluation.