Adaptively evaluating models with task elicitation

πŸ“… 2025-03-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
The rapid advancement of large language models (LLMs) outpaces the availability of up-to-date, human-annotated evaluation data, hindering timely and reliable assessment of model capabilities. Method: We propose a task-elicited adaptive evaluation framework grounded in scaffolded evaluation agents. It performs behavioral space search and dynamic task generation over domain-specific corpora to automatically discover high-difficulty, high-discriminative failure cases. The method supports cross-model transfer of challenging instances and incorporates human verification to ensure validity. Contribution/Results: Applied to legal reasoning, predictive modeling, and online harassment detection, the framework uncovers systematic inconsistency flaws in state-of-the-art LLMs. The resulting benchmark exhibits strong generalizability across models with diverse capability profiles, enabling high-quality, sustainable, domain-specific evaluationβ€”a novel paradigm for robust LLM assessment.

Technology Category

Application Category

πŸ“ Abstract
Manual curation of evaluation datasets is struggling to keep up with the rapidly expanding capabilities and deployment scenarios of language models. Towards scalable model profiling, we introduce and validate a framework for evaluating LLMs, called Adaptive Evaluations. Adaptive evaluations use scaffolded language models (evaluator agents) to search through a target model's behavior on a domain dataset and create difficult questions (tasks) that can discover and probe the model's failure modes. We find that frontier models lack consistency when adaptively probed with our framework on a diverse suite of datasets and tasks, including but not limited to legal reasoning, forecasting, and online harassment. Generated questions pass human validity checks and often transfer to other models with different capability profiles, demonstrating that adaptive evaluations can also be used to create difficult domain-specific datasets.
Problem

Research questions and friction points this paper is trying to address.

Scalable evaluation of rapidly evolving language models
Identifying model failure modes through adaptive probing
Creating domain-specific datasets for diverse tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Evaluations framework for scalable model profiling
Scaffolded language models create domain-specific difficult tasks
Generated questions transfer across models, ensuring broad applicability
πŸ”Ž Similar Papers
No similar papers found.