TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing LLM agent evaluations predominantly focus on final answer correctness, neglecting the fidelity of tool usage trajectories—including tool selection, parameterization, and invocation ordering. Method: We propose TrajBench, the first trajectory-aware benchmark, featuring a high-fidelity executable tool suite and production-grade API-synthesized trajectories that span diverse parallel breadth and dependency depth. Contribution/Results: TrajBench introduces a novel fine-grained trajectory-level evaluation framework, enabling the first systematic diagnosis of critical bottlenecks—such as tool confusion, parameter blind selection, and long-trajectory transfer failure. Empirical analysis reveals scaling behaviors across tool diversity and trajectory length, identifies dominant failure modes, and delivers reproducible optimization pathways. The framework supports multidimensional, interpretable, and fine-grained assessment of tool-calling capabilities.

Technology Category

Application Category

📝 Abstract

Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks. While existing works evaluate the LLMs' tool use capability, they largely focus on the final answers yet overlook the detailed tool usage trajectory, i.e., whether tools are selected, parameterized, and ordered correctly. We introduce TRAJECT-Bench, a trajectory-aware benchmark to comprehensively evaluate LLMs' tool use capability through diverse tasks with fine-grained evaluation metrics. TRAJECT-Bench pairs high-fidelity, executable tools across practical domains with tasks grounded in production-style APIs, and synthesizes trajectories that vary in breadth (parallel calls) and depth (interdependent chains). Besides final accuracy, TRAJECT-Bench also reports trajectory-level diagnostics, including tool selection and argument correctness, and dependency/order satisfaction. Analyses reveal failure modes such as similar tool confusion and parameter-blind selection, and scaling behavior with tool diversity and trajectory length where the bottleneck of transiting from short to mid-length trajectories is revealed, offering actionable guidance for LLMs' tool use.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM agents' tool selection and parameterization accuracy

Assessing trajectory correctness in parallel and interdependent tool calls

Identifying failure modes in tool usage across diverse task scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes trajectory-aware benchmark for tool use

Evaluates tool selection parameterization and ordering

Reveals failure modes and scaling behavior bottlenecks

🔎 Similar Papers

Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?