Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks lack systematic evaluation of large language models’ (LLMs) ability to internalize real-world empirical distributions—a foundational requirement for causal reasoning, particularly at Pearl’s first level of the causal hierarchy (observational distribution knowledge). Method: We introduce the first dedicated benchmark for distributional knowledge assessment, comprising a high-quality, multi-domain test set (economics, health, education, social behavior) derived from large-scale real-world statistical data. It features question-answering and distribution-prediction tasks designed to probe models’ grasp of population-level statistical regularities. Contribution/Results: Experiments reveal that state-of-the-art LLMs exhibit significant deficits in observational distribution cognition, indicating weak grounding for causal inference. This work fills a critical gap in probabilistic knowledge evaluation and establishes a scalable, extensible framework for assessing distributional awareness—providing both a diagnostic tool and a foundational benchmark to advance LLMs’ real-world modeling capabilities.

Technology Category

Application Category

📝 Abstract
Artificial intelligence (AI) systems hold great promise for advancing various scientific disciplines, and are increasingly used in real-world applications. Despite their remarkable progress, further capabilities are expected in order to achieve more general types of intelligence. A critical distinction in this context is between factual knowledge, which can be evaluated against true or false answers (e.g.,"what is the capital of England?"), and probabilistic knowledge, reflecting probabilistic properties of the real world (e.g.,"what is the sex of a computer science graduate in the US?"). In this paper, our goal is to build a benchmark for understanding the capabilities of LLMs in terms of knowledge of probability distributions describing the real world. Given that LLMs are trained on vast amounts of text, it may be plausible that they internalize aspects of these distributions. Indeed, LLMs are touted as powerful universal approximators of real-world distributions. At the same time, classical results in statistics, known as curse of dimensionality, highlight fundamental challenges in learning distributions in high dimensions, challenging the notion of universal distributional learning. In this work, we develop the first benchmark to directly test this hypothesis, evaluating whether LLMs have access to empirical distributions describing real-world populations across domains such as economics, health, education, and social behavior. Our results demonstrate that LLMs perform poorly overall, and do not seem to internalize real-world statistics naturally. When interpreted in the context of Pearl's Causal Hierarchy (PCH), our benchmark demonstrates that language models do not contain knowledge on observational distributions (Layer 1 of PCH), and thus the Causal Hierarchy Theorem implies that interventional (Layer 2) and counterfactual (Layer 3) knowledge of these models is also limited.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLMs' knowledge of real-world probability distributions
Evaluating observational distribution learning across multiple domains
Testing limitations in learning high-dimensional statistical distributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed benchmark for testing LLM distribution knowledge
Evaluated observational distribution knowledge across multiple domains
Assessed knowledge using Pearl's Causal Hierarchy framework
🔎 Similar Papers
No similar papers found.