AppSelectBench: Application-Level Tool Selection Benchmark

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks focus narrowly on fine-grained API selection and thus fail to assess models’ cross-application reasoning and decision-making capabilities. Method: We introduce AppSelect—the first application-level tool selection benchmark for Computer-Using Agents (CUAs)—covering 100+ desktop applications and over 100,000 real-world user tasks. It features a novel semantic-driven task generation pipeline and a unified evaluation protocol supporting diverse inference paradigms, including zero-shot, few-shot, retrieval-augmented, and heuristic-based reasoning. Contribution/Results: Experiments reveal systematic deficiencies in mainstream large language models when performing cross-application tool selection. AppSelect establishes a reproducible, quantifiable, and large-scale evaluation foundation for application-level reasoning in CUAs, addressing a critical gap in the field.

Technology Category

Application Category

📝 Abstract
Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, together with unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented-settings. AppSelectBench covers one hundred widely used desktop applications and includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks. Extensive experiments across both closed-source and open-source large language models reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. Together, these results establish AppSelectBench as a foundation for studying and advancing application level reasoning, an essential yet underexplored capability of intelligent CUAs. The source is available at https://github.com/microsoft/appselectbench.
Problem

Research questions and friction points this paper is trying to address.

Evaluates application selection capability in computer-using agents
Assesses reasoning across different applications for user tasks
Measures consistent application choice performance across language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for application selection in computer agents
Pipeline generates realistic user tasks at scale
Evaluates inter-application reasoning across diverse settings
🔎 Similar Papers
No similar papers found.