🤖 AI Summary
This study investigates how large language models (LLMs) achieve introspection, specifically their ability to distinguish whether they detect “thought injections” through reasoning or via direct access to internal states. Reproducing the thought-injection detection paradigm, we systematically analyze introspective mechanisms in open-source LLMs and, for the first time, clearly disentangle two decoupled pathways: probabilistic-matching-based reasoning and content-agnostic direct access. Our experiments reveal that models often rely on high-frequency concrete concepts (e.g., “apple”) to generate confabulated guesses, whereas accurately identifying semantic content requires substantially more tokens—thereby confirming the existence and limitations of the direct-access mechanism. These findings align with dominant theories of introspection in psychology and philosophy and offer novel empirical support for cognitive architectures in artificial intelligence.
📝 Abstract
Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first extensively replicating Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models. We show that these models detect injected representations via two separable mechanisms: (i) probability-matching (inferring from perceived anomaly of the prompt) and (ii) direct access to internal states. The direct access mechanism is content-agnostic: models detect that an anomaly occurred but cannot reliably identify its semantic content. The two model classes we study confabulate injected concepts that are high-frequency and concrete (e.g.,"apple'"); for them correct concept guesses typically require significantly more tokens. This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.