🤖 AI Summary
Contemporary large language models (LLMs), such as GPT-4.1, exhibit critical deficiencies in factual knowledge—namely, hallucination, internal inconsistency, and semantic ambiguity—yet existing evaluations rely on small-scale, biased benchmarks that inadequately reflect real-world knowledge reliability.
Method: We systematically extract and analyze 100 million fact-belief statements generated by the model. To ensure representativeness and reduce sampling bias, we introduce Recursive Prompting to construct GPTKB v1.5, a large-scale, unbiased factual dataset. We then apply statistical modeling and cross-source validation against authoritative knowledge bases (e.g., Wikidata, DBpedia).
Contribution/Results: Our analysis quantifies pervasive hallucinations and inconsistencies in LLM knowledge for the first time at scale, revealing significantly lower factual accuracy than reported on standard benchmarks, substantial distributional divergence from structured knowledge repositories, and severe methodological biases in current evaluation protocols. This work establishes a reproducible, large-scale empirical framework and a new benchmark paradigm for rigorous LLM knowledge assessment.
📝 Abstract
LLMs are remarkable artifacts that have revolutionized a range of NLP and AI tasks. A significant contributor is their factual knowledge, which, to date, remains poorly understood, and is usually analyzed from biased samples. In this paper, we take a deep tour into the factual knowledge (or beliefs) of a frontier LLM, based on GPTKB v1.5 (Hu et al., 2025a), a recursively elicited set of 100 million beliefs of one of the strongest currently available frontier LLMs, GPT-4.1. We find that the models' factual knowledge differs quite significantly from established knowledge bases, and that its accuracy is significantly lower than indicated by previous benchmarks. We also find that inconsistency, ambiguity and hallucinations are major issues, shedding light on future research opportunities concerning factual LLM knowledge.