TravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning

📅 2025-12-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing travel planning benchmarks suffer from limited domain coverage and inadequate support for multi-turn interaction, hindering systematic evaluation of agents’ dynamic planning and tool orchestration capabilities. To address this, we introduce TravelBench—the first realistic, multi-turn travel planning benchmark—featuring dynamic preference elicitation, multi-step reasoning, and constrained external tool invocation. We construct three subsets of real-user requests (multi-turn, single-turn, and unsolvable), design a controllable sandbox environment with ten deterministic-output tools, and integrate dialogue state tracking, constraint-aware response generation, and tool-call simulation. Evaluated on real-user data across mainstream LLMs, TravelBench reveals significant bottlenecks in iterative planning, tool coordination, and hard-constraint adaptation. It provides a reproducible, standardized evaluation platform for travel planning agents.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) agents have demonstrated strong capabilities in planning and tool use. Travel planning provides a natural and high-impact testbed for these capabilities, as it requires multi-step reasoning, iterative preference elicitation through interaction, and calls to external tools under evolving constraints. Prior work has studied LLMs on travel-planning tasks, but existing settings are limited in domain coverage and multi-turn interaction. As a result, they cannot support dynamic user-agent interaction and therefore fail to comprehensively assess agent capabilities. In this paper, we introduce TravelBench, a real-world travel-planning benchmark featuring multi-turn interaction and tool use. We collect user requests from real-world scenarios and construct three subsets-multi-turn, single-turn, and unsolvable-to evaluate different aspects of agent performance. For stable and reproducible evaluation, we build a controlled sandbox environment with 10 travel-domain tools, providing deterministic tool outputs for reliable reasoning. We evaluate multiple LLMs on TravelBench and conduct an analysis of their behaviors and performance. TravelBench offers a practical and reproducible benchmark for advancing LLM agents in travel planning.
Problem

Research questions and friction points this paper is trying to address.

Develops a benchmark for multi-turn travel planning with tools
Evaluates LLM agents on dynamic user interaction and tool use
Addresses limitations in domain coverage and interaction realism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn interaction benchmark for travel planning
Controlled sandbox environment with deterministic tools
Real-world user requests for dynamic evaluation
🔎 Similar Papers
No similar papers found.
X
Xiang Cheng
Gaoling School of Artificial Intelligence, Renmin University of China
Y
Yulan Hu
AMAP, Alibaba Group
X
Xiangwen Zhang
AMAP, Alibaba Group
Lu Xu
Lu Xu
Postdoc, Riken AIP
deep learningmachine learningcomputer vision
Z
Zheng Pan
AMAP, Alibaba Group
X
Xin Li
AMAP, Alibaba Group
Y
Yong Liu
Gaoling School of Artificial Intelligence, Renmin University of China