🤖 AI Summary
This work addresses the limitations of existing agent-based search systems, which often neglect efficiency in evaluation and struggle with queries involving ambiguous user preferences, thereby hindering real-world deployment. The authors construct a benchmark dataset comprising 214 hotel search queries spanning a spectrum of complexity, from simple to intricate, and explicitly model users’ implicit preferences through collected clarification interactions. For the first time, they introduce a cost-performance–aware evaluation framework that jointly accounts for query complexity, preference ambiguity, and system efficiency. Experimental results reveal that while large language model (LLM) agents achieve relatively high accuracy, they frequently exhibit redundant tool invocations and mismatches between their capabilities and task requirements, highlighting substantial opportunities for cost optimization.
📝 Abstract
Agentic search has emerged as a promising paradigm for adaptive retrieval systems powered by large language models (LLMs). However, existing benchmarks primarily focus on quality, overlooking efficiency factors that are critical for real-world deployment. Moreover, real-world user queries often contain underspecified preferences, a challenge that remains largely underexplored in current agentic search evaluation. As a result, many agentic search systems remain impractical despite their impressive performance. In this work, we introduce HotelQuEST, a benchmark comprising 214 hotel search queries that range from simple factual requests to complex queries, enabling evaluation across the full spectrum of query difficulty. We further address the challenge of evaluating underspecified user preferences by collecting clarifications that make annotators'implicit preferences explicit for evaluation. We find that LLM-based agents achieve higher accuracy than traditional retrievers, but at substantially higher costs due to redundant tool calls and suboptimal routing that fails to match query complexity to model capability. Our analysis exposes inefficiencies in current agentic search systems and demonstrates substantial potential for cost-aware optimization.