🤖 AI Summary
General-purpose AI assistants exhibit significant limitations in “tip-of-the-tongue” (TOT) tasks—i.e., retrieving and reasoning about known-but-inaccessible knowledge—where humans achieve near-perfect accuracy (98%) but current models lag substantially (56%).
Method: We introduce TongueTipBench, the first multimodal, multilingual benchmark explicitly designed for TOT scenarios. It comprises 573 validation questions derived from real user behavior, requiring joint capabilities in tool invocation, cross-modal understanding, and logical reasoning. We formally define and quantify the TOT capability gap, incorporating multimodal retrieval, cross-lingual alignment, tool orchestration, and behavior-driven question construction.
Contribution/Results: We release a public leaderboard (350 questions), a supervised training set (200 questions), and a private test set (23 questions). State-of-the-art models achieve only 56% accuracy, underscoring the benchmark’s rigor and its value in exposing fundamental gaps in AI knowledge access and reasoning.
📝 Abstract
We introduce Browsing Lost Unformed Recollections, a tip-of-the-tongue known-item search and reasoning benchmark for general AI assistants. BLUR introduces a set of 573 real-world validated questions that demand searching and reasoning across multi-modal and multilingual inputs, as well as proficient tool use, in order to excel on. Humans easily ace these questions (scoring on average 98%), while the best-performing system scores around 56%. To facilitate progress toward addressing this challenging and aspirational use case for general AI assistants, we release 350 questions through a public leaderboard, retain the answers to 250 of them, and have the rest as a private test set.