π€ AI Summary
This study addresses two key challenges in tip-of-the-tongue (TOT) retrieval: usersβ difficulty in formulating effective queries and the reliance of existing methods on low-quality community question-answering (CQA) data, which limits evaluation validity. To overcome these, we propose two scalable TOT query generation approaches: (1) an LLM-driven TOT user simulator employing cognitive-inspired prompt engineering to generate high-fidelity queries; and (2) a vision-stimulus-guided interface for collecting authentic TOT queries from real users across diverse domains (Movie, Landmark, Person). Our work breaks the dependency on CQA data and establishes the first reproducible, multi-domain TOT benchmark. Experiments show that LLM-generated queries achieve strong rank-order agreement with human queries in the Movie domain (Spearman Ο > 0.9) and high lexical similarity (0.82). The benchmark has been adopted as the official test collection for the TREC 2024/2025 TOT track, and all code and stimulus materials are publicly released.
π Abstract
Tip-of-the-tongue (TOT) search occurs when a user struggles to recall a specific identifier, such as a document title. While common, existing search systems often fail to effectively support TOT scenarios. Research on TOT retrieval is further constrained by the challenge of collecting queries, as current approaches rely heavily on community question-answering (CQA) websites, leading to labor-intensive evaluation and domain bias. To overcome these limitations, we introduce two methods for eliciting TOT queries - leveraging large language models (LLMs) and human participants - to facilitate simulated evaluations of TOT retrieval systems. Our LLM-based TOT user simulator generates synthetic TOT queries at scale, achieving high correlations with how CQA-based TOT queries rank TOT retrieval systems when tested in the Movie domain. Additionally, these synthetic queries exhibit high linguistic similarity to CQA-derived queries. For human-elicited queries, we developed an interface that uses visual stimuli to place participants in a TOT state, enabling the collection of natural queries. In the Movie domain, system rank correlation and linguistic similarity analyses confirm that human-elicited queries are both effective and closely resemble CQA-based queries. These approaches reduce reliance on CQA-based data collection while expanding coverage to underrepresented domains, such as Landmark and Person. LLM-elicited queries for the Movie, Landmark, and Person domains have been released as test queries in the TREC 2024 TOT track, with human-elicited queries scheduled for inclusion in the TREC 2025 TOT track. Additionally, we provide source code for synthetic query generation and the human query collection interface, along with curated visual stimuli used for eliciting TOT queries.