๐ค AI Summary
Existing evaluations lack rigorous assessment of large language model (LLM) agentsโ ability to jointly perform multi-turn tool orchestration and user preference optimization for complex constrained planning tasks. Method: We construct a realistic tool ecosystem comprising verified transportation, accommodation, and ticketing databases covering 20 U.S. National Parks, integrated with a simulated commercial booking platform. We propose the first multi-turn constrained optimization and preference coordination benchmark tailored to travel planning, unifying dialogue modeling, tool invocation, hard-constraint satisfaction, and soft-preference optimization. Results: Experiments reveal that mainstream LLMs reliably satisfy hard constraints but exhibit significant limitations in soft-preference optimization (e.g., timeโcost trade-offs) and cross-service coordinated planning; open-source models underperform further. Our framework quantifies, for the first time, the systematic gap between feasible solutions and Pareto-optimal solutions, establishing a novel benchmark for evaluating planning robustness in LLM agents.
๐ Abstract
Real-world large language model (LLM) agents must master strategic tool use and user preference optimization through multi-turn interactions to assist users with complex planning tasks. We introduce COMPASS (Constrained Optimization through Multi-turn Planning and Strategic Solutions), a benchmark that evaluates agents on realistic travel-planning scenarios. We cast travel planning as a constrained preference optimization problem, where agents must satisfy hard constraints while simultaneously optimizing soft user preferences. To support this, we build a realistic travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks, along with a comprehensive tool ecosystem that mirrors commercial booking platforms. Evaluating state-of-the-art models, we uncover two critical gaps: (i) an acceptable-optimal gap, where agents reliably meet constraints but fail to optimize preferences, and (ii) a plan-coordination gap, where performance collapses on multi-service (flight and hotel) coordination tasks, especially for open-source models. By grounding reasoning and planning in a practical, user-facing domain, COMPASS provides a benchmark that directly measures an agent's ability to optimize user preferences in realistic tasks, bridging theoretical advances with real-world impact.