TPS-Bench: Evaluating AI Agents' Tool Planning & Scheduling Abilities in Compounding Tasks

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
This work addresses the evaluation of tool planning and scheduling capabilities of large language model (LLM) agents in composite real-world tasks. To this end, we introduce TPS-Bench—the first dedicated benchmark for tool planning and scheduling—comprising 200 multi-subtask scenarios synthesized from hundreds of MCP tools, with task completion rate and execution time as primary metrics. Methodologically, we systematically evaluate leading LLMs (e.g., GLM-4.5, GPT-4o, Qwen3-1.7B) on sequential versus parallel scheduling trade-offs and propose a few-shot reinforcement learning framework to optimize scheduling policies. Results show that GLM-4.5 achieves the highest completion rate (64.72%) but suffers from low efficiency; GPT-4o excels in speed yet attains only 45.08% completion; after lightweight RL fine-tuning, Qwen3-1.7B improves completion rate by 6% and reduces execution time by 14%. This work establishes the first specialized evaluation framework for tool scheduling, reveals critical impacts of scheduling paradigms, and validates an efficient, parameter-light optimization pathway.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) agents have exhibited strong problem-solving competence across domains like research and coding. Yet, it remains underexplored whether LLM agents can tackle compounding real-world problems that require a diverse set of tools to complete. Given a broad, heterogeneous tool repository, LLM agents must not only select appropriate tools based on task planning analysis but also strategically schedule the execution order to ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of LLM agents in solving such problems that demand Tool Planning and Scheduling. TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a tool repository containing hundreds of model context protocol (MCP) tools. In particular, each task is composed of multiple subtasks, such as web search, map navigation, calendar checking, etc., and each subtask can be completed by a basic tool. Our evaluation emphasizes both task completion rate and efficiency. The empirical studies on popular closed-source and open-source LLMs indicate that most models can perform reasonable tool planning, but differ in scheduling. For example, GLM-4.5 achieves an outperforming task completion rate of 64.72% with extensive sequential tool calls, hence suffering from significantly long execution time. By contrast, GPT-4o prioritizes parallel tool calls but achieves only a 45.08% completion rate. Considering reinforcement learning (RL) can be a viable way to improve the scheduling efficiency without compromising performance, we perform an initial study on Qwen3-1.7B and witness a 14% reduction in execution time alongside a 6% gain in task completion rate based on rarely 100 RL training samples. Our code is available https://github.com/hanwenxu1/mcp-agent.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM agents' ability to handle compounding tasks requiring multiple tools
Benchmarking tool planning and scheduling efficiency in heterogeneous tool repositories
Assessing task completion rates versus execution time optimization in AI agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates tool planning and scheduling abilities
Uses reinforcement learning to optimize execution efficiency
Tests agents on compounding tasks with diverse tools