AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

πŸ“… 2026-03-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the absence of a unified end-to-end benchmark for large language model (LLM) agent recommendation tailored to narrative queries, where existing evaluation approaches are fragmented and ill-suited for conditional recommendations. We propose AgentSelect, the first benchmark that formulates agent selection as a capability-profile-based recommendation task from narrative queries to agents. It integrates heterogeneous data from over 40 sources to construct a large-scale, unified positive dataset comprising more than 110K queries, 100K agents, and 250K interactions. By leveraging capability profiling, heterogeneous data alignment, synthetic compositional interaction generation, and content-aware matching, AgentSelect enables a unified representation of LLM-only, toolkit-only, and hybrid agents. Experiments demonstrate that our approach significantly outperforms conventional collaborative filtering under long-tailed distributions and consistently improves performance on previously unseen agent directories in the MuleRun marketplace.

Technology Category

Application Category

πŸ“ Abstract
LLM agents are rapidly becoming the practical interface for task automation, yet the ecosystem lacks a principled way to choose among an exploding space of deployable configurations. Existing LLM leaderboards and tool/agent benchmarks evaluate components in isolation and remain fragmented across tasks, metrics, and candidate pools, leaving a critical research gap: there is little query-conditioned supervision for learning to recommend end-to-end agent configurations that couple a backbone model with a toolkit. We address this gap with AgentSelect, a benchmark that reframes agent selection as narrative query-to-agent recommendation over capability profiles and systematically converts heterogeneous evaluation artifacts into unified, positive-only interaction data. AgentSelectcomprises 111,179 queries, 107,721 deployable agents, and 251,103 interaction records aggregated from 40+ sources, spanning LLM-only, toolkit-only, and compositional agents. Our analyses reveal a regime shift from dense head reuse to long-tail, near one-off supervision, where popularity-based CF/GNN methods become fragile and content-aware capability matching is essential. We further show that Part~III synthesized compositional interactions are learnable, induce capability-sensitive behavior under controlled counterfactual edits, and improve coverage over realistic compositions; models trained on AgentSelect also transfer to a public agent marketplace (MuleRun), yielding consistent gains on an unseen catalog. Overall, AgentSelect provides the first unified data and evaluation infrastructure for agent recommendation, which establishes a reproducible foundation to study and accelerate the emerging agent ecosystem.
Problem

Research questions and friction points this paper is trying to address.

agent recommendation
narrative query
LLM agents
capability matching
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

agent recommendation
capability-aware matching
compositional agents
narrative query
benchmark synthesis
πŸ”Ž Similar Papers
No similar papers found.