π€ AI Summary
The rapid advancement of large language models (LLMs) outpaces the availability of up-to-date, human-annotated evaluation data, hindering timely and reliable assessment of model capabilities.
Method: We propose a task-elicited adaptive evaluation framework grounded in scaffolded evaluation agents. It performs behavioral space search and dynamic task generation over domain-specific corpora to automatically discover high-difficulty, high-discriminative failure cases. The method supports cross-model transfer of challenging instances and incorporates human verification to ensure validity.
Contribution/Results: Applied to legal reasoning, predictive modeling, and online harassment detection, the framework uncovers systematic inconsistency flaws in state-of-the-art LLMs. The resulting benchmark exhibits strong generalizability across models with diverse capability profiles, enabling high-quality, sustainable, domain-specific evaluationβa novel paradigm for robust LLM assessment.
π Abstract
Manual curation of evaluation datasets is struggling to keep up with the rapidly expanding capabilities and deployment scenarios of language models. Towards scalable model profiling, we introduce and validate a framework for evaluating LLMs, called Adaptive Evaluations. Adaptive evaluations use scaffolded language models (evaluator agents) to search through a target model's behavior on a domain dataset and create difficult questions (tasks) that can discover and probe the model's failure modes. We find that frontier models lack consistency when adaptively probed with our framework on a diverse suite of datasets and tasks, including but not limited to legal reasoning, forecasting, and online harassment. Generated questions pass human validity checks and often transfer to other models with different capability profiles, demonstrating that adaptive evaluations can also be used to create difficult domain-specific datasets.