DETOUR: An Interactive Benchmark for Dual-Agent Search and Reasoning

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing evaluation benchmarks, which predominantly focus on single-turn dialogues and fail to capture the incremental, context-sensitive nature of human memory recall—often described as the “tip-of-the-tongue” phenomenon—during multi-turn interactions. To bridge this gap, the authors propose DETOUR, a multimodal, multi-turn dual-agent evaluation benchmark comprising 1,011 underspecified prompts. In this framework, a primary agent collaborates with a fixed memory agent to dynamically retrieve and reason over information under ambiguity. DETOUR integrates text, images, audio, and video modalities and incorporates a consistency-based memory mechanism that better mirrors real-world human recall behavior. Experimental results reveal that state-of-the-art models achieve only a 36% average accuracy across all modalities, highlighting their substantial limitations in dynamic, ambiguous retrieval scenarios.

Technology Category

Application Category

📝 Abstract
When recalling information in conversation, people often arrive at the recollection after multiple turns. However, existing benchmarks for evaluating agent capabilities in such tip-of-the-tongue search processes are restricted to single-turn settings. To more realistically simulate tip-of-the-tongue search, we introduce Dual-agent based Evaluation Through Obscure Under-specified Retrieval (DETOUR), a dual-agent evaluation benchmark containing 1,011 prompts. The benchmark design involves a Primary Agent, which is the subject of evaluation, tasked with identifying the recollected entity through querying a Memory Agent that is held consistent across evaluations. Our results indicate that current state-of-the-art models still struggle with our benchmark, only achieving 36% accuracy when evaluated on all modalities (text, image, audio, and video), highlighting the importance of enhancing capabilities in underspecified scenarios.
Problem

Research questions and friction points this paper is trying to address.

dual-agent
tip-of-the-tongue
underspecified retrieval
interactive benchmark
multi-turn search
Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-agent
interactive benchmark
tip-of-the-tongue search
underspecified retrieval
multi-turn reasoning