Dissociating Direct Access from Inference in AI Introspection

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how large language models (LLMs) achieve introspection, specifically their ability to distinguish whether they detect “thought injections” through reasoning or via direct access to internal states. Reproducing the thought-injection detection paradigm, we systematically analyze introspective mechanisms in open-source LLMs and, for the first time, clearly disentangle two decoupled pathways: probabilistic-matching-based reasoning and content-agnostic direct access. Our experiments reveal that models often rely on high-frequency concrete concepts (e.g., “apple”) to generate confabulated guesses, whereas accurately identifying semantic content requires substantially more tokens—thereby confirming the existence and limitations of the direct-access mechanism. These findings align with dominant theories of introspection in psychology and philosophy and offer novel empirical support for cognitive architectures in artificial intelligence.

Technology Category

Application Category

📝 Abstract
Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first extensively replicating Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models. We show that these models detect injected representations via two separable mechanisms: (i) probability-matching (inferring from perceived anomaly of the prompt) and (ii) direct access to internal states. The direct access mechanism is content-agnostic: models detect that an anomaly occurred but cannot reliably identify its semantic content. The two model classes we study confabulate injected concepts that are high-frequency and concrete (e.g.,"apple'"); for them correct concept guesses typically require significantly more tokens. This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.
Problem

Research questions and friction points this paper is trying to address.

introspection
direct access
inference
thought injection
internal states
Innovation

Methods, ideas, or system contributions that make the work stand out.

introspection
direct access
probability-matching
thought injection
content-agnostic
🔎 Similar Papers
No similar papers found.
H
Harvey Lederman
Department of Philosophy, The University of Texas at Austin
Kyle Mahowald
Kyle Mahowald
UT Austin
computational linguisticspsycholinguisticsnatural language processingcognitive science