Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models

📅 2024-05-15
🏛️ arXiv.org
📈 Citations: 28
Influential: 4
📄 PDF
🤖 AI Summary
This study investigates large language models’ (LLMs) understanding of foundational world knowledge—distinct from superficial statistical co-occurrence—to assess their capacity for world modeling in contextual language understanding. To this end, we introduce EWOK, a cognitive-science-informed framework that formalizes 11 categories of human core world knowledge (e.g., social interaction, spatial relations) and defines minimal paired context–target matching tasks. We construct EWOK-CORE-1.0, the first concept-driven, template-based, multi-paradigm (zero-shot, few-shot, fine-tuning) scalable benchmark for world knowledge evaluation, accompanied by extensive human baseline experiments (12,480 annotations). Across 4,374 test instances, 20 open-source LLMs (1.3B–70B parameters) consistently underperform humans, with pronounced cross-domain disparities. Our results systematically expose fundamental deficiencies in current LLMs’ grasp of basic conceptual world knowledge.

Technology Category

Application Category

📝 Abstract
The ability to build and leverage world models is essential for a general-purpose AI agent. Testing such capabilities is hard, in part because the building blocks of world models are ill-defined. We present Elements of World Knowledge (EWOK), a framework for evaluating world modeling in language models by testing their ability to use knowledge of a concept to match a target text with a plausible/implausible context. EWOK targets specific concepts from multiple knowledge domains known to be vital for world modeling in humans. Domains range from social interactions (help/hinder) to spatial relations (left/right). Both, contexts and targets are minimal pairs. Objects, agents, and locations in the items can be flexibly filled in enabling easy generation of multiple controlled datasets. We then introduce EWOK-CORE-1.0, a dataset of 4,374 items covering 11 world knowledge domains. We evaluate 20 openweights large language models (1.3B--70B parameters) across a battery of evaluation paradigms along with a human norming study comprising 12,480 measurements. The overall performance of all tested models is worse than human performance, with results varying drastically across domains. These data highlight simple cases where even large models fail and present rich avenues for targeted research on LLM world modeling capabilities.
Problem

Research questions and friction points this paper is trying to address.

Evaluating language models' conceptual world knowledge understanding
Disentangling conceptual knowledge from surface co-occurrence statistics
Assessing model performance across diverse world knowledge domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cognition-inspired framework for world knowledge evaluation
Flexible dataset generation with controlled variables
Comparative analysis of LLMs versus human performance
🔎 Similar Papers
No similar papers found.