🤖 AI Summary
Current interpretability research on large language models (LLMs) lacks a unified theoretical framework, hindering systematic understanding of their behavioral logic and internal mechanisms.
Method: We systematically adapt Marr’s three-level cognitive science framework—computational theory, algorithmic implementation, and physical realization—to LLM analysis, establishing an interdisciplinary interpretability framework. Our approach integrates cognitive modeling, behavioral experimentation, neurosymbolic interface analysis, and representation probing, yielding a reusable, cognitive-science-inspired analytical protocol.
Contribution/Results: Validated across multiple mainstream LLMs, the framework enables rigorous mechanistic attribution, bias溯源 (i.e., root-cause tracing of biases), and fine-grained capability decomposition. It advances LLM understanding beyond empirical engineering heuristics toward principled, scientifically grounded modeling and explanation—bridging cognitive theory and foundation model science.
📝 Abstract
Modern artificial intelligence systems, such as large language models, are increasingly powerful but also increasingly hard to understand. Recognizing this problem as analogous to the historical difficulties in understanding the human mind, we argue that methods developed in cognitive science can be useful for understanding large language models. We propose a framework for applying these methods based on Marr's three levels of analysis. By revisiting established cognitive science techniques relevant to each level and illustrating their potential to yield insights into the behavior and internal organization of large language models, we aim to provide a toolkit for making sense of these new kinds of minds.