🤖 AI Summary
The evaluation of large language models (LLMs) lags behind their rapid advancement—static benchmarks are quickly saturated, while dynamic benchmark construction is prohibitively costly. Method: This paper proposes BeTaL, the first framework to embed LLMs directly into the benchmark design loop. Leveraging environment design principles and a parameterized template space, BeTaL enables automated generation of dynamic benchmarks with controllable difficulty and tunable realism. An LLM-in-the-loop mechanism jointly optimizes benchmark attributes via reasoning-guided search. Contribution/Results: Experiments yield two novel benchmarks and extend τ-bench. Across multi-task, multi-difficulty settings, BeTaL achieves mean difficulty deviation of only 5.3%–13.2%, reducing error by 2–4× over baselines—effectively alleviating the evaluation bottleneck.
📝 Abstract
The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space to obtain target properties (such as difficulty and realism) in a cost-efficient manner. We validate this approach on its ability to create benchmarks with desired difficulty levels. Using BeTaL, we create two new benchmarks and extend a popular agentic benchmark τ -bench. Extensive evaluation on these three tasks and multiple target difficulty levels shows that BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2% -- a 2-4x improvement over the baselines.