🤖 AI Summary
This paper addresses the fundamental question of whether AI models possess genuine introspective capability. To resolve ambiguities in existing definitions, the authors propose a rigorous “thick introspection” criterion: an agent must access its internal states with computational cost no greater than that of a third-party observer and with higher reliability. Leveraging this definition, the study empirically evaluates large language models’ (LLMs) ability to self-report their internal temperature parameter—a canonical introspective task. Results show that while LLMs exhibit superficial introspective behavior under weaker definitions, their self-reports are significantly less reliable than external measurements and fail to satisfy the joint cost–reliability constraint. This work clarifies the conceptual boundaries of introspection, establishes the first quantitative benchmark for self-referential reasoning in LLMs, and demonstrates—via empirically grounded criteria—that current LLMs lack substantive introspective capacity. It thus provides both a theoretical framework and empirical foundation for advancing research on AI self-awareness.
📝 Abstract
Whether AI models can introspect is an increasingly important practical question. But there is no consensus on how introspection is to be defined. Beginning from a recently proposed ''lightweight'' definition, we argue instead for a thicker one. According to our proposal, introspection in AI is any process which yields information about internal states through a process more reliable than one with equal or lower computational cost available to a third party. Using experiments where LLMs reason about their internal temperature parameters, we show they can appear to have lightweight introspection while failing to meaningfully introspect per our proposed definition.