🤖 AI Summary
This work addresses the limitation of existing evaluation benchmarks, which predominantly focus on single-turn dialogues and fail to capture the incremental, context-sensitive nature of human memory recall—often described as the “tip-of-the-tongue” phenomenon—during multi-turn interactions. To bridge this gap, the authors propose DETOUR, a multimodal, multi-turn dual-agent evaluation benchmark comprising 1,011 underspecified prompts. In this framework, a primary agent collaborates with a fixed memory agent to dynamically retrieve and reason over information under ambiguity. DETOUR integrates text, images, audio, and video modalities and incorporates a consistency-based memory mechanism that better mirrors real-world human recall behavior. Experimental results reveal that state-of-the-art models achieve only a 36% average accuracy across all modalities, highlighting their substantial limitations in dynamic, ambiguous retrieval scenarios.
📝 Abstract
When recalling information in conversation, people often arrive at the recollection after multiple turns. However, existing benchmarks for evaluating agent capabilities in such tip-of-the-tongue search processes are restricted to single-turn settings. To more realistically simulate tip-of-the-tongue search, we introduce Dual-agent based Evaluation Through Obscure Under-specified Retrieval (DETOUR), a dual-agent evaluation benchmark containing 1,011 prompts. The benchmark design involves a Primary Agent, which is the subject of evaluation, tasked with identifying the recollected entity through querying a Memory Agent that is held consistent across evaluations. Our results indicate that current state-of-the-art models still struggle with our benchmark, only achieving 36% accuracy when evaluated on all modalities (text, image, audio, and video), highlighting the importance of enhancing capabilities in underspecified scenarios.