π€ AI Summary
Existing active testing methods struggle to efficiently evaluate generative language models due to their high cost and inapplicability to non-classification tasks. This work extends active testing to generative settings for the first time, introducing a semantic entropyβbased stratified sampling strategy. The approach partitions model outputs into strata using semantic entropy and employs proxy model signals to approximate Neyman allocation, enabling accurate performance estimation with a small set of representative samples. Experimental results demonstrate that the proposed method reduces mean squared error by up to 28% compared to uniform sampling and achieves an average annotation budget saving of 22.9%, while closely approaching the performance of the Oracle-Neyman benchmark.
π Abstract
Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm tailored to generative tasks. Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates. Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28\% MSE reduction over Uniform Sampling and an average of 22.9\% budget savings.