Feeling the Strength but Not the Source: Partial Introspection in LLMs

📅 2025-12-13

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study investigates large language models’ (LLMs’) capacity for self-perception—specifically, their ability to reliably detect, identify, and quantify conceptually injected activations within their own internal representations. Method: Using Meta-Llama-3.1-8B-Instruct, we employ activation injection, multi-round prompt engineering, and systematic prompt robustness evaluation to probe introspective capabilities. Contribution/Results: We report the first empirical evidence of “partial introspection” in LLMs: while models fail to robustly recognize the semantic identity of injected concepts (naming accuracy: 20%, replicating Anthropic’s findings), they reliably classify their relative strength (70% accuracy, significantly exceeding the 25% random baseline). This capability is highly prompt-sensitive and narrowly domain-specific, indicating that LLMs possess limited yet quantifiable intrinsic representational awareness. Our findings provide novel empirical grounding for advancing model transparency and interpretability research.

Technology Category

Application Category

📝 Abstract

Recent work from Anthropic claims that frontier models can sometimes detect and name injected "concepts" represented as activation directions. We test the robustness of these claims. First, we reproduce Anthropic's multi-turn "emergent introspection" result on Meta-Llama-3.1-8B-Instruct, finding that the model identifies and names the injected concept 20 percent of the time under Anthropic's original pipeline, exactly matching their reported numbers and thus showing that introspection is not exclusive to very large or capable models. Second, we systematically vary the inference prompt and find that introspection is fragile: performance collapses on closely related tasks such as multiple-choice identification of the injected concept or different prompts of binary discrimination of whether a concept was injected at all. Third, we identify a contrasting regime of partial introspection: the same model can reliably classify the strength of the coefficient of a normalized injected concept vector (as weak / moderate / strong / very strong) with up to 70 percent accuracy, far above the 25 percent chance baseline. Together, these results provide more evidence for Anthropic's claim that language models effectively compute a function of their baseline, internal representations during introspection; however, these self-reports about those representations are narrow and prompt-sensitive. Our code is available at https://github.com/elyhahami18/CS2881-Introspection.

Problem

Research questions and friction points this paper is trying to address.

Tests robustness of LLM introspection claims

Examines fragility of concept detection in models

Identifies partial introspection in strength classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reproducing introspection on smaller models

Testing fragility with varied inference prompts

Identifying partial introspection for concept strength

🔎 Similar Papers

No similar papers found.