Dissociating Direct Access from Inference in AI Introspection

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This study investigates how large language models (LLMs) achieve introspection, specifically their ability to distinguish whether they detect “thought injections” through reasoning or via direct access to internal states. Reproducing the thought-injection detection paradigm, we systematically analyze introspective mechanisms in open-source LLMs and, for the first time, clearly disentangle two decoupled pathways: probabilistic-matching-based reasoning and content-agnostic direct access. Our experiments reveal that models often rely on high-frequency concrete concepts (e.g., “apple”) to generate confabulated guesses, whereas accurately identifying semantic content requires substantially more tokens—thereby confirming the existence and limitations of the direct-access mechanism. These findings align with dominant theories of introspection in psychology and philosophy and offer novel empirical support for cognitive architectures in artificial intelligence.

Technology Category

Application Category

📝 Abstract

Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first extensively replicating Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models. We show that these models detect injected representations via two separable mechanisms: (i) probability-matching (inferring from perceived anomaly of the prompt) and (ii) direct access to internal states. The direct access mechanism is content-agnostic: models detect that an anomaly occurred but cannot reliably identify its semantic content. The two model classes we study confabulate injected concepts that are high-frequency and concrete (e.g.,"apple'"); for them correct concept guesses typically require significantly more tokens. This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.

Problem

Research questions and friction points this paper is trying to address.

introspection

direct access

inference

thought injection

internal states

Innovation

Methods, ideas, or system contributions that make the work stand out.

introspection

direct access

probability-matching