AgentSearchBench: A Benchmark for AI Agent Search in the Wild

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

237K/year
🤖 AI Summary
Existing approaches struggle to accurately retrieve task-appropriate agents within complex, interdependent real-world AI agent ecosystems. This work proposes AgentSearchBench—the first large-scale benchmark for real-world agent search—which formalizes agent retrieval as a description-based retrieval and reranking problem and introduces, for the first time, execution-driven signals to assess relevance. We develop a lightweight behavioral probing mechanism that integrates execution-aware probes with reranking algorithms to construct an end-to-end evaluation framework. Experimental results demonstrate that conventional semantic-description-based retrieval methods exhibit significant limitations, whereas lightweight strategies incorporating execution signals substantially improve both search accuracy and practical utility.

Technology Category

Application Category

📝 Abstract
The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.
Problem

Research questions and friction points this paper is trying to address.

AI agent search
agent retrieval
execution-dependent capabilities
real-world agents
task-agent matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

agent search
execution-grounded evaluation
retrieval and reranking
behavioral probing
AI agent benchmark