InteractComp: Evaluating Search Agents With Ambiguous Queries

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

In real-world search, user queries are often ambiguous or incomplete, necessitating interactive clarification; however, existing search agents lack such capabilities, and no suitable evaluation benchmark exists. Method: We introduce InteractComp, a novel benchmark comprising 210 expert-crafted ambiguous questions across nine domains, featuring a first-of-its-kind “target–distractor” methodology to generate realistic, verifiable, interaction-required disambiguation tasks. Contribution/Results: InteractComp reveals severe overconfidence in mainstream models: the best-performing of 17 models achieves only 13.73% accuracy under interactive settings—far below its 71.50% accuracy with full context. Forcing interaction significantly improves performance, demonstrating that current prompting strategies fail to activate models’ latent interactive reasoning capacity. This work provides the first systematic evaluation exposing critical deficiencies in search agents’ ability to recognize ambiguity and proactively seek clarification, establishing a new standard and empirical foundation for interactive search research.

Technology Category

Application Category

📝 Abstract

Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at https://github.com/FoundationAgents/InteractComp.

Problem

Research questions and friction points this paper is trying to address.

Evaluating search agents handling ambiguous user queries

Assessing interactive disambiguation capabilities during search

Identifying systematic overconfidence in query resolution models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces benchmark for evaluating search agent interaction

Uses target-distractor methodology to create ambiguous queries

Forces interaction to resolve ambiguity and improve accuracy

🔎 Similar Papers

No similar papers found.