π€ AI Summary
Standard entropy-based uncertainty quantification (UQ) methods fail in retrieval-augmented generation (RAG) because activation of the modelβs induction heads can cause correct answers to be misclassified as highly uncertain. This work uncovers, for the first time, a βtug-of-warβ effect between induction heads and entropy neurons in RAG systems and proposes a mechanism-driven, induction-aware entropy gating method that calibrates predictive entropy using interpretable internal contextual signals. Evaluated across four RAG benchmarks and six open-source large language models (ranging from 4B to 13B parameters), the proposed approach consistently matches or outperforms existing UQ techniques, substantially improving hallucination detection performance.
π Abstract
While retrieval-augmented generation (RAG) significantly improves the factual reliability of LLMs, it does not eliminate hallucinations, so robust uncertainty quantification (UQ) remains essential. In this paper, we reveal that standard entropy-based UQ methods often fail in RAG settings due to a mechanistic paradox. An internal "tug-of-war" inherent to context utilization appears: while induction heads promote grounded responses by copying the correct answer, they collaterally trigger the previously established "entropy neurons". This interaction inflates predictive entropy, causing the model to signal false uncertainty on accurate outputs. To address this, we propose INTRYGUE (Induction-Aware Entropy Gating for Uncertainty Estimation), a mechanistically grounded method that gates predictive entropy based on the activation patterns of induction heads. Evaluated across four RAG benchmarks and six open-source LLMs (4B to 13B parameters), INTRYGUE consistently matches or outperforms a wide range of UQ baselines. Our findings demonstrate that hallucination detection in RAG benefits from combining predictive uncertainty with interpretable, internal signals of context utilization.