GroupTravelBench: Benchmarking LLM Agents on Multi-Person Travel Planning

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the limitation of existing evaluations for large language model (LLM) agents in travel planning, which predominantly focus on single-user scenarios and fail to assess coordination under conflicting multi-user preferences. To bridge this gap, we introduce GroupTravelBench—the first benchmark for multi-user, multi-turn conversational travel planning—comprising 650 synthetically generated tasks grounded in real user profiles, points of interest, and fare data. The benchmark formalizes three core capabilities: preference elicitation, user coordination, and fair, feasible planning, and features an interactive sandbox environment enabling offline evaluation. By integrating cached real-world tool data, multi-turn dialogue simulation, task difficulty stratification, and group utility optimization, our framework supports reproducible LLM agent assessment. Experimental results reveal that even state-of-the-art models exhibit significant shortcomings in comprehensively capturing user preferences and ensuring group fairness.

📝 Abstract

Travel planning is a realistic task for evaluating the planning and tool-use abilities of LLM agents. However, existing benchmarks typically assume only a single user, thereby avoiding one of the most challenging aspects of real-world scenarios: an agent's ability to identify and resolve conflicts among multiple users. To address this gap, we introduce \textbf{GroupTravelBench}, the first benchmark for \textbf{multi-user, multi-turn} travel planning. Based on real user profiles, POI data, and ticket price data, we synthesize 650 tasks and divide them into three difficulty levels. Beyond standard abilities in single-user itinerary planning, such as multi-step reasoning and tool use, our benchmark further evaluates three key capabilities required for travel agents: \emph{(i) elicitation} -- proactively engaging in multi-turn dialogue to gather preferences from each user; \emph{(ii) coordination} -- resolving conflicts among users through compromise or subgrouping strategies; and \emph{(iii) planning} -- searching for travel plans that maximize overall group utility while maintaining fairness and feasibility. To simulate real-world conversational itinerary planning while enabling reliable tool use and offline evaluation, we build an interactive sandbox environment with cached real-world tool data. We evaluate a wide range of LLMs and find that even frontier models still show substantial weaknesses in preference coverage and group fairness. \textit{GroupTravelBench} provides a practical and reproducible benchmark for advancing research on LLM agents for real-world travel planning.

Problem

Research questions and friction points this paper is trying to address.

multi-user travel planning

LLM agents

preference conflict resolution

group coordination

travel itinerary benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-user planning

LLM agent benchmarking

preference elicitation