Query-efficient model evaluation using cached responses

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the high computational cost and limited reusability of cached responses in traditional black-box model evaluation, which typically requires generating responses for all queries. To overcome these limitations, the authors propose the Data Kernel Perspective Space (DKPS) framework, which models inter-model relationships under black-box settings to predict a new model’s performance on benchmarks using cached responses. Furthermore, they devise an offline optimal query selection strategy to enhance fitting efficiency. Experimental results demonstrate that, at the same mean absolute error, the proposed method significantly reduces the number of required queries, and the selected query sets yield substantially higher prediction accuracy compared to random selection, thereby enabling efficient and accurate model evaluation.

📝 Abstract

Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previously-evaluated models are often cached -- creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model. In this paper, we introduce an approach for predicting benchmark performance that leverages cached model responses based on the Data Kernel Perspective Space (DKPS), a method for quantifying the relationship between models in the black-box setting. Theoretically, we show that DKPS-based methods are query-efficient under certain conditions. Empirically, we demonstrate that DKPS-based methods achieve the same mean absolute error as baselines with a substantially decreased query budget. We conclude by proposing an offline method for selecting a set of queries that maximizes the goodness-of-fit on reference models, improving prediction accuracy over random query selection.

Problem

Research questions and friction points this paper is trying to address.

query-efficient evaluation

cached responses

model evaluation

benchmark performance

black-box setting

Innovation

Methods, ideas, or system contributions that make the work stand out.

query-efficient evaluation

cached responses

Data Kernel Perspective Space