COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing evaluations lack rigorous assessment of large language model (LLM) agents’ ability to jointly perform multi-turn tool orchestration and user preference optimization for complex constrained planning tasks. Method: We construct a realistic tool ecosystem comprising verified transportation, accommodation, and ticketing databases covering 20 U.S. National Parks, integrated with a simulated commercial booking platform. We propose the first multi-turn constrained optimization and preference coordination benchmark tailored to travel planning, unifying dialogue modeling, tool invocation, hard-constraint satisfaction, and soft-preference optimization. Results: Experiments reveal that mainstream LLMs reliably satisfy hard constraints but exhibit significant limitations in soft-preference optimization (e.g., time–cost trade-offs) and cross-service coordinated planning; open-source models underperform further. Our framework quantifies, for the first time, the systematic gap between feasible solutions and Pareto-optimal solutions, establishing a novel benchmark for evaluating planning robustness in LLM agents.

Technology Category

Application Category

📝 Abstract

Real-world large language model (LLM) agents must master strategic tool use and user preference optimization through multi-turn interactions to assist users with complex planning tasks. We introduce COMPASS (Constrained Optimization through Multi-turn Planning and Strategic Solutions), a benchmark that evaluates agents on realistic travel-planning scenarios. We cast travel planning as a constrained preference optimization problem, where agents must satisfy hard constraints while simultaneously optimizing soft user preferences. To support this, we build a realistic travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks, along with a comprehensive tool ecosystem that mirrors commercial booking platforms. Evaluating state-of-the-art models, we uncover two critical gaps: (i) an acceptable-optimal gap, where agents reliably meet constraints but fail to optimize preferences, and (ii) a plan-coordination gap, where performance collapses on multi-service (flight and hotel) coordination tasks, especially for open-source models. By grounding reasoning and planning in a practical, user-facing domain, COMPASS provides a benchmark that directly measures an agent's ability to optimize user preferences in realistic tasks, bridging theoretical advances with real-world impact.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM agents' multi-turn tool use for complex planning tasks

Optimizing user preferences while satisfying hard constraints in travel planning

Identifying performance gaps in preference optimization and multi-service coordination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn tool-mediated planning benchmark

Constrained preference optimization problem formulation

Realistic travel database and tool ecosystem

🔎 Similar Papers

No similar papers found.