Potemkin Understanding in Large Language Models

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper challenges the validity of standard LLM benchmarking: models may produce correct answers while maintaining concept representations incompatible with human understanding—a phenomenon termed “Potemkin understanding” (superficial correctness masking internal incoherence). To address this, the authors formally define and quantify Potemkin understanding for the first time, proposing a dual-path detection framework: (1) domain-specific benchmarking integrating concept consistency analysis, cross-model reasoning comparison, and human alignment validation; and (2) a general lower-bound estimation method augmented by controlled prompt perturbations to assess internal logical coherence. Experiments across mainstream models, tasks, and domains reveal that Potemkin understanding is pervasive, exposing systematic conceptual incoherence in LLMs’ internal representations. The work establishes an interpretable, scalable verification paradigm for evaluating LLM capabilities beyond surface-level accuracy.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs -- such as AP exams -- are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' true understanding via benchmark datasets
Identifying illusion of understanding in LLM responses
Measuring internal incoherence in LLM concept representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formal framework for evaluating LLM understanding
Special benchmark to quantify potemkin understanding
General procedure to estimate potemkin prevalence
🔎 Similar Papers
No similar papers found.
M
Marina Mancoridis
Massachusetts Institute of Technology
B
Bec Weeks
University of Chicago
Keyon Vafa
Keyon Vafa
Harvard University
Machine learning
Sendhil Mullainathan
Sendhil Mullainathan
MIT
Algorithms and peopleEconomicsMachine learning.Behavioral Sciencr