🤖 AI Summary
This work investigates whether large language models (LLMs) possess genuine conceptual understanding of physics or merely exhibit superficial, pattern-based imitation (“stochastic parrots”).
Method: We introduce PhysiCo, a novel benchmark that employs abstract grid-based inputs to eliminate memorization bias and proposes “gridified abstract representation” as a paradigm to decouple linguistic surface form from underlying conceptual comprehension. PhysiCo systematically evaluates physical reasoning via multi-level concept encoding—phenomenon, application, and analogy—and conducts zero-shot, few-shot, and fine-tuning experiments.
Contribution/Results: State-of-the-art models such as GPT-4o underperform human baselines by ~40% on PhysiCo. While excelling at natural language description, they fail markedly on grid-based reasoning, revealing a fundamental deficit in deep conceptual understanding. Format-adaptive fine-tuning yields only marginal gains. PhysiCo establishes a rigorous, representation-decoupled benchmark and methodological framework for assessing conceptual understanding in LLMs.
📝 Abstract
In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.