Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

📅 2026-02-15

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the common yet imprecise attribution of factual errors in large language models (LLMs) to mere knowledge absence, proposing a behavioral analysis framework that disentangles whether facts are unencoded (“empty shelves”) or encoded but inaccessible (“lost keys”). The authors introduce WikiProfile, a novel benchmark enabling empirical evaluation at the fact level, and present the first systematic categorization of factual errors into encoding gaps versus recall failures. Analyzing over 4 million responses from 13 models—including state-of-the-art systems like GPT-5 and Gemini-3—they find that 95–98% of facts are already encoded, with recall failure being the primary bottleneck, especially for long-tail facts and reverse queries. Notably, incorporating reasoning processes significantly enhances recall performance. The study employs an automated construction pipeline and a multi-granularity accessibility classification scheme to support these findings.

Technology Category

Application Category

📝 Abstract

Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

Problem

Research questions and friction points this paper is trying to address.

factuality

recall

knowledge encoding

large language models

behavioral evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

factuality

recall bottleneck

knowledge encoding