Attention Head Entropy of LLMs Predicts Answer Correctness

📅 2026-02-14

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge of detecting hallucinations in large language models (LLMs) within high-stakes scenarios, where existing evaluation methods are either costly or exhibit limited generalization. The authors propose a white-box prediction approach based on attention mechanisms, introducing for the first time a 2-Rényi entropy metric to quantify the dispersion of attention distributions across individual attention heads at each layer. By leveraging only the attention patterns from the question and context encoding phases—without relying on model-generated outputs—they construct a sparsity-constrained logistic regression model to predict answer correctness. Evaluated across five prominent LLMs and three question-answering benchmarks, the method matches baseline performance in-domain and achieves an average 8.5% AUROC improvement out-of-domain. Notably, using solely question/context-phase attention yields a 17.7% AUROC gain over baselines, substantially enhancing cross-domain generalization.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) often generate plausible yet incorrect answers, posing risks in safety-critical settings such as medicine. Human evaluation is expensive, and LLM-as-judge approaches risk introducing hidden errors. Recent white-box methods detect contextual hallucinations using model internals, focusing on the localization of the attention mass, but two questions remain open: do these approaches extend to predicting answer correctness, and do they generalize out-of-domains? We introduce Head Entropy, a method that predicts answer correctness from attention entropy patterns, specifically measuring the spread of the attention mass. Using sparse logistic regression on per-head 2-Renyi entropies, Head Entropy matches or exceeds baselines in-distribution and generalizes substantially better on out-of-domains, it outperforms the closest baseline on average by +8.5% AUROC. We further show that attention patterns over the question/context alone, before answer generation, already carry predictive signal using Head Entropy with on average +17.7% AUROC over the closest baseline. We evaluate across 5 instruction-tuned LLMs and 3 QA datasets spanning general knowledge, multi-hop reasoning, and medicine.

Problem

Research questions and friction points this paper is trying to address.

answer correctness

large language models

hallucination detection

out-of-domain generalization

attention entropy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Head Entropy

attention entropy

answer correctness prediction