🤖 AI Summary
This work addresses the common yet imprecise attribution of factual errors in large language models (LLMs) to mere knowledge absence, proposing a behavioral analysis framework that disentangles whether facts are unencoded (“empty shelves”) or encoded but inaccessible (“lost keys”). The authors introduce WikiProfile, a novel benchmark enabling empirical evaluation at the fact level, and present the first systematic categorization of factual errors into encoding gaps versus recall failures. Analyzing over 4 million responses from 13 models—including state-of-the-art systems like GPT-5 and Gemini-3—they find that 95–98% of facts are already encoded, with recall failure being the primary bottleneck, especially for long-tail facts and reverse queries. Notably, incorporating reasoning processes significantly enhances recall performance. The study employs an automated construction pipeline and a multi-granularity accessibility classification scheme to support these findings.
📝 Abstract
Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.