LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Large language models (LLMs) frequently generate incorrect answers with high confidence, and determining when and how to effectively leverage external retrieval contexts remains challenging. This paper proposes an early trustworthiness auditing method based on internal model activations: a lightweight classifier is trained on intermediate-layer activations corresponding to the first output token to predict both final answer correctness and the utility of retrieved context. A novel metric is introduced to distinguish correct, incorrect, and irrelevant contexts. The approach requires no additional prompting or fine-tuning, achieving ~75% accuracy in predicting answer correctness across six mainstream LLMs and significantly outperforming prompt-engineering baselines in context utility identification. Its core innovation lies in the first use of first-token activations as an interpretable, low-overhead proxy signal for trustworthiness—enabling real-time, fine-grained internal auditing of LLM generation processes.

Technology Category

Application Category

📝 Abstract

Although large language models (LLMs) have tremendous utility, trustworthiness is still a chief concern: models often generate incorrect information with high confidence. While contextual information can help guide generation, identifying when a query would benefit from retrieved context and assessing the effectiveness of that context remains challenging. In this work, we operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs from the model's activations alone. We also explore whether model internals contain signals about the efficacy of external context. We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them. Experiments on six different models reveal that a simple classifier trained on intermediate layer activations of the first output token can predict output correctness with about 75% accuracy, enabling early auditing. Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context, guarding against inaccuracies introduced by polluted context. These findings offer a lens to better understand the underlying decision-making processes of LLMs. Our code is publicly available at https://github.com/jiarui-liu/LLM-Microscope

Problem

Research questions and friction points this paper is trying to address.

Predicting LLM output correctness using internal model activations

Assessing external context effectiveness through model internals analysis

Distinguishing between correct, incorrect and irrelevant contextual information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts output correctness using model activations

Measures context efficacy through internal model signals

Enables early auditing with simple activation classifiers

🔎 Similar Papers

Racing Thoughts: Explaining Large Language Model Contextualization Errors