Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
General-purpose AI assistants exhibit significant limitations in “tip-of-the-tongue” (TOT) tasks—i.e., retrieving and reasoning about known-but-inaccessible knowledge—where humans achieve near-perfect accuracy (98%) but current models lag substantially (56%). Method: We introduce TongueTipBench, the first multimodal, multilingual benchmark explicitly designed for TOT scenarios. It comprises 573 validation questions derived from real user behavior, requiring joint capabilities in tool invocation, cross-modal understanding, and logical reasoning. We formally define and quantify the TOT capability gap, incorporating multimodal retrieval, cross-lingual alignment, tool orchestration, and behavior-driven question construction. Contribution/Results: We release a public leaderboard (350 questions), a supervised training set (200 questions), and a private test set (23 questions). State-of-the-art models achieve only 56% accuracy, underscoring the benchmark’s rigor and its value in exposing fundamental gaps in AI knowledge access and reasoning.

Technology Category

Application Category

📝 Abstract
We introduce Browsing Lost Unformed Recollections, a tip-of-the-tongue known-item search and reasoning benchmark for general AI assistants. BLUR introduces a set of 573 real-world validated questions that demand searching and reasoning across multi-modal and multilingual inputs, as well as proficient tool use, in order to excel on. Humans easily ace these questions (scoring on average 98%), while the best-performing system scores around 56%. To facilitate progress toward addressing this challenging and aspirational use case for general AI assistants, we release 350 questions through a public leaderboard, retain the answers to 250 of them, and have the rest as a private test set.
Problem

Research questions and friction points this paper is trying to address.

Develop benchmark for tip-of-the-tongue search tasks
Evaluate AI on multi-modal multilingual reasoning
Assess tool proficiency in known-item retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal and multilingual search benchmark
Real-world validated questions dataset
Public leaderboard for performance tracking
🔎 Similar Papers
No similar papers found.