Language Models Fail to Introspect About Their Knowledge of Language

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Prior work conflates model introspection—the ability to accurately report on its own internal probability distribution—with generalization performance, lacking a rigorous, knowledge-controlled metric for genuine self-access. Method: We propose a novel introspectiveness measure: the extent to which an LLM’s metalinguistic responses (e.g., “Is this sentence grammatical?”) predict *its own* token-level probabilities, significantly exceeding the predictive baseline of structurally similar but knowledge-distinct models. We systematically evaluate 21 open-weight LLMs on syntactic judgment and word prediction tasks, integrating theoretical modeling, cross-model correlation analysis, and controlled ablation experiments. Contribution/Results: High task accuracy does not imply true introspection; model responses fail to specifically track their own internal probability distributions. We find no empirical evidence for privileged “self-access” capability. This work introduces the first knowledge-calibrated, quantitative introspection metric, challenging prevailing assumptions about LLM metacognition and establishing a methodological benchmark for future research.

Technology Category

Application Category

📝 Abstract

There has been recent interest in whether large language models (LLMs) can introspect about their own internal states. Such abilities would make LLMs more interpretable, and also validate the use of standard introspective methods in linguistics to evaluate grammatical knowledge in models (e.g., asking"Is this sentence grammatical?"). We systematically investigate emergent introspection across 21 open-source LLMs, in two domains where introspection is of theoretical interest: grammatical knowledge and word prediction. Crucially, in both domains, a model's internal linguistic knowledge can be theoretically grounded in direct measurements of string probability. We then evaluate whether models' responses to metalinguistic prompts faithfully reflect their internal knowledge. We propose a new measure of introspection: the degree to which a model's prompted responses predict its own string probabilities, beyond what would be predicted by another model with nearly identical internal knowledge. While both metalinguistic prompting and probability comparisons lead to high task accuracy, we do not find evidence that LLMs have privileged"self-access". Our findings complicate recent results suggesting that models can introspect, and add new evidence to the argument that prompted responses should not be conflated with models' linguistic generalizations.

Problem

Research questions and friction points this paper is trying to address.

Investigate if LLMs can introspect about internal linguistic knowledge.

Evaluate if metalinguistic prompts reflect models' internal knowledge accurately.

Propose a new measure to assess LLMs' introspection capabilities.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically investigate introspection in 21 LLMs

Propose new measure: model responses predict string probabilities

Evaluate metalinguistic prompts against internal linguistic knowledge

🔎 Similar Papers

No similar papers found.