GroupTravelBench: Benchmarking LLM Agents on Multi-Person Travel Planning

📅 2026-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing evaluations for large language model (LLM) agents in travel planning, which predominantly focus on single-user scenarios and fail to assess coordination under conflicting multi-user preferences. To bridge this gap, we introduce GroupTravelBench—the first benchmark for multi-user, multi-turn conversational travel planning—comprising 650 synthetically generated tasks grounded in real user profiles, points of interest, and fare data. The benchmark formalizes three core capabilities: preference elicitation, user coordination, and fair, feasible planning, and features an interactive sandbox environment enabling offline evaluation. By integrating cached real-world tool data, multi-turn dialogue simulation, task difficulty stratification, and group utility optimization, our framework supports reproducible LLM agent assessment. Experimental results reveal that even state-of-the-art models exhibit significant shortcomings in comprehensively capturing user preferences and ensuring group fairness.
📝 Abstract
Travel planning is a realistic task for evaluating the planning and tool-use abilities of LLM agents. However, existing benchmarks typically assume only a single user, thereby avoiding one of the most challenging aspects of real-world scenarios: an agent's ability to identify and resolve conflicts among multiple users. To address this gap, we introduce \textbf{GroupTravelBench}, the first benchmark for \textbf{multi-user, multi-turn} travel planning. Based on real user profiles, POI data, and ticket price data, we synthesize 650 tasks and divide them into three difficulty levels. Beyond standard abilities in single-user itinerary planning, such as multi-step reasoning and tool use, our benchmark further evaluates three key capabilities required for travel agents: \emph{(i) elicitation} -- proactively engaging in multi-turn dialogue to gather preferences from each user; \emph{(ii) coordination} -- resolving conflicts among users through compromise or subgrouping strategies; and \emph{(iii) planning} -- searching for travel plans that maximize overall group utility while maintaining fairness and feasibility. To simulate real-world conversational itinerary planning while enabling reliable tool use and offline evaluation, we build an interactive sandbox environment with cached real-world tool data. We evaluate a wide range of LLMs and find that even frontier models still show substantial weaknesses in preference coverage and group fairness. \textit{GroupTravelBench} provides a practical and reproducible benchmark for advancing research on LLM agents for real-world travel planning.
Problem

Research questions and friction points this paper is trying to address.

multi-user travel planning
LLM agents
preference conflict resolution
group coordination
travel itinerary benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-user planning
LLM agent benchmarking
preference elicitation
conflict resolution
interactive sandbox