Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This study investigates whether large language models possess human-like introspective “privileged knowledge”—an internal judgment of the correctness of their own answers that cannot be inferred from external observations. By training probe classifiers on a model’s hidden states (self-representations) and those of other models (other-representations) over standard benchmarks and subsets where predictions diverge, the authors systematically compare their performance. The work reveals, for the first time, that self-representations significantly outperform other-representations on divergent samples in factual tasks, indicating the presence of domain-specific privileged knowledge—an effect absent in mathematical reasoning tasks. Furthermore, the privileged knowledge signal emerges progressively in early-to-middle layers of the network for factual tasks, supporting the hypothesis that models rely on task-specific memory retrieval mechanisms.

Technology Category

Application Category

📝 Abstract

Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

Problem

Research questions and friction points this paper is trying to address.

privileged knowledge

large language models

answer correctness

introspection

model disagreement

Innovation

Methods, ideas, or system contributions that make the work stand out.

privileged knowledge

self-representation

disagreement subsets