A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Current agent benchmarks suffer from saturation, high construction costs, and limited tool coverage. To address these limitations, this work proposes TASTE, a novel paradigm that inversely generates tasks from tool sequences. TASTE integrates an adaptive contrastive n-gram model, tool sequence clustering, and an iterative difficulty optimization mechanism to automatically construct benchmark tasks with high coverage and elevated complexity. The resulting τ^c-Bench substantially increases task difficulty across three domains, causing a significant performance drop in models that previously approached saturation on τ²-Bench. Moreover, τ^c-Bench more than doubles the number of unique tool combinations, thereby enhancing both the diversity and challenge of agent evaluation.

📝 Abstract

As agent capabilities advance, existing benchmarks, such as $τ^2$-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns agents exercise. In this paper, we address these problems by reversing the task construction process. We propose TASTE: Task Synthesis from Tool Sequence Evolution, an automatic method that generates challenging tasks with broader tool-use coverage. TASTE utilizes an Adaptive Contrastive $n$-gram model trained on LLM-judged validity signals. This enables sampling valid tool sequences that cover a vast range of tool combinations. TASTE then selects representative sequences from the pool via clustering, instantiates them into complete benchmark tasks, and refines them through iterative difficulty evolution. Using TASTE, we construct $τ^c$-Bench, a challenging extension of the three domains of $τ^2$-Bench. We evaluate $11$ agent/user LLM pairs and find that models nearly saturating $τ^2$-Bench suffer severe performance drops on our tasks (e.g., Gemini-3-Flash falls from $0.82\!-\!0.94$ to $0.28\!-\!0.61$). Beyond increasing difficulty, our generated tasks more than double the number of unique tool combinations agents must execute. Our results suggest high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. By automating the generation of difficult, high-coverage benchmarks, TASTE enables continuous, scalable evaluation of future agents.

Problem

Research questions and friction points this paper is trying to address.

agent benchmarks

task coverage

benchmark difficulty

tool-use patterns

benchmark saturation

Innovation

Methods, ideas, or system contributions that make the work stand out.

TASTE

tool-use coverage

benchmark synthesis