🤖 AI Summary
Existing travel planning benchmarks suffer from limited domain coverage and inadequate support for multi-turn interaction, hindering systematic evaluation of agents’ dynamic planning and tool orchestration capabilities. To address this, we introduce TravelBench—the first realistic, multi-turn travel planning benchmark—featuring dynamic preference elicitation, multi-step reasoning, and constrained external tool invocation. We construct three subsets of real-user requests (multi-turn, single-turn, and unsolvable), design a controllable sandbox environment with ten deterministic-output tools, and integrate dialogue state tracking, constraint-aware response generation, and tool-call simulation. Evaluated on real-user data across mainstream LLMs, TravelBench reveals significant bottlenecks in iterative planning, tool coordination, and hard-constraint adaptation. It provides a reproducible, standardized evaluation platform for travel planning agents.
📝 Abstract
Large language model (LLM) agents have demonstrated strong capabilities in planning and tool use. Travel planning provides a natural and high-impact testbed for these capabilities, as it requires multi-step reasoning, iterative preference elicitation through interaction, and calls to external tools under evolving constraints. Prior work has studied LLMs on travel-planning tasks, but existing settings are limited in domain coverage and multi-turn interaction. As a result, they cannot support dynamic user-agent interaction and therefore fail to comprehensively assess agent capabilities. In this paper, we introduce TravelBench, a real-world travel-planning benchmark featuring multi-turn interaction and tool use. We collect user requests from real-world scenarios and construct three subsets-multi-turn, single-turn, and unsolvable-to evaluate different aspects of agent performance. For stable and reproducible evaluation, we build a controlled sandbox environment with 10 travel-domain tools, providing deterministic tool outputs for reliable reasoning. We evaluate multiple LLMs on TravelBench and conduct an analysis of their behaviors and performance. TravelBench offers a practical and reproducible benchmark for advancing LLM agents in travel planning.