🤖 AI Summary
This study systematically investigates cognitive disparities between humans and large language models (LLMs)—exemplified by GPT-3.5 and GPT-4—in algorithmic understanding. Method: We propose the first formal, quantifiable five-level framework for algorithmic understanding, integrating philosophical, psychological, and pedagogical perspectives. Using cognitive modeling, human-annotated evaluation protocols, and human–AI double-blind experiments, we compare undergraduate and graduate students with multiple GPT versions. Contribution/Results: Results show that GPT achieves near-human performance in syntactic parsing and stepwise execution but exhibits significant deficits in higher-order competencies—particularly abstract transfer and causal explanation—manifesting a “superficially correct, deeply deficient” pattern. The hierarchical scale has emerged as a community-recognized benchmark for assessing algorithmic understanding capabilities.
📝 Abstract
As Large Language Models (LLMs) perform (and sometimes excel at) more and more complex cognitive tasks, a natural question is whether AI really understands. The study of understanding in LLMs is in its infancy, and the community has yet to incorporate well-trodden research in philosophy, psychology, and education. We initiate this, specifically focusing on understanding algorithms, and propose a hierarchy of levels of understanding. We use the hierarchy to design and conduct a study with human subjects (undergraduate and graduate students) as well as large language models (generations of GPT), revealing interesting similarities and differences. We expect that our rigorous criteria will be useful to keep track of AI's progress in such cognitive domains.