🤖 AI Summary
Hallucination in large language models (LLMs) stems from factual discrepancies between the model’s internal world model and observable reality. Method: This work formally unifies diverse hallucination phenomena under the concept of “world-modeling inaccuracy,” rigorously distinguishing hallucination from planning or reward misalignment; introduces a formal definition that explicitly accounts for bias sources (e.g., knowledge bases, context) and conflict-resolution strategies; and constructs the first hallucination evaluation framework grounded in a fully specified synthetic world—encompassing conceptual analysis, formal modeling, benchmark design, and reproducible experimental protocols. Contribution/Results: The framework provides a theoretically grounded, scalable methodology for systematically assessing and improving LLMs’ factual consistency, enabling precise diagnosis of hallucination origins and facilitating principled mitigation strategies. It establishes a new standard for rigorous, controllable hallucination evaluation beyond heuristic or human-annotated benchmarks.
📝 Abstract
Despite numerous attempts to solve the issue of hallucination since the inception of neural language models, it remains a problem in even frontier large language models today. Why is this the case? We walk through definitions of hallucination used in the literature from a historical perspective up to the current day, and fold them into a single definition of hallucination, wherein different prior definitions focus on different aspects of our definition. At its core, we argue that hallucination is simply inaccurate (internal) world modeling, in a form where it is observable to the user (e.g., stating a fact which contradicts a knowledge base, or producing a summary which contradicts a known source). By varying the reference world model as well as the knowledge conflict policy (e.g., knowledge base vs. in-context), we arrive at the different existing definitions of hallucination present in the literature.
We argue that this unified view is useful because it forces evaluations to make clear their assumed "world" or source of truth, clarifies what should and should not be called hallucination (as opposed to planning or reward/incentive-related errors), and provides a common language to compare benchmarks and mitigation techniques. Building on this definition, we outline plans for a family of benchmarks in which hallucinations are defined as mismatches with synthetic but fully specified world models in different environments, and sketch out how these benchmarks can use such settings to stress-test and improve the world modeling components of language models.