TravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning

📅 2025-12-27

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing travel planning benchmarks suffer from limited domain coverage and inadequate support for multi-turn interaction, hindering systematic evaluation of agents’ dynamic planning and tool orchestration capabilities. To address this, we introduce TravelBench—the first realistic, multi-turn travel planning benchmark—featuring dynamic preference elicitation, multi-step reasoning, and constrained external tool invocation. We construct three subsets of real-user requests (multi-turn, single-turn, and unsolvable), design a controllable sandbox environment with ten deterministic-output tools, and integrate dialogue state tracking, constraint-aware response generation, and tool-call simulation. Evaluated on real-user data across mainstream LLMs, TravelBench reveals significant bottlenecks in iterative planning, tool coordination, and hard-constraint adaptation. It provides a reproducible, standardized evaluation platform for travel planning agents.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) agents have demonstrated strong capabilities in planning and tool use. Travel planning provides a natural and high-impact testbed for these capabilities, as it requires multi-step reasoning, iterative preference elicitation through interaction, and calls to external tools under evolving constraints. Prior work has studied LLMs on travel-planning tasks, but existing settings are limited in domain coverage and multi-turn interaction. As a result, they cannot support dynamic user-agent interaction and therefore fail to comprehensively assess agent capabilities. In this paper, we introduce TravelBench, a real-world travel-planning benchmark featuring multi-turn interaction and tool use. We collect user requests from real-world scenarios and construct three subsets-multi-turn, single-turn, and unsolvable-to evaluate different aspects of agent performance. For stable and reproducible evaluation, we build a controlled sandbox environment with 10 travel-domain tools, providing deterministic tool outputs for reliable reasoning. We evaluate multiple LLMs on TravelBench and conduct an analysis of their behaviors and performance. TravelBench offers a practical and reproducible benchmark for advancing LLM agents in travel planning.

Problem

Research questions and friction points this paper is trying to address.

Develops a benchmark for multi-turn travel planning with tools

Evaluates LLM agents on dynamic user interaction and tool use

Addresses limitations in domain coverage and interaction realism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn interaction benchmark for travel planning

Controlled sandbox environment with deterministic tools

Real-world user requests for dynamic evaluation

🔎 Similar Papers

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning