Interpreting and Mitigating Unwanted Uncertainty in LLMs

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from “answer flipping”: repeated queries cause previously correct outputs to flip to incorrect ones, undermining reliability—especially in high-stakes applications. This work identifies, via mechanistic interpretability, critical non-retrieval attention heads that aberrantly attend to misleading tokens under uncertainty, constituting the primary cause of flipping. We propose a head-level masking intervention that selectively suppresses these heads, effectively mitigating flipping without degrading text coherence or inducing overcorrection. To enable controlled evaluation, we construct flip-prone scenarios using a Needle-in-a-Haystack retrieval framework and Flip-style re-evaluation prompts, coupled with attention attribution for precise targeting. Experiments across multiple settings show up to a 15% reduction in flipping rate. This is the first study to causally link specific attention heads to LLM uncertainty stability, offering an interpretable, intervention-aware pathway toward trustworthy LLM design.

Technology Category

Application Category

📝 Abstract
Despite their impressive capabilities, Large Language Models (LLMs) exhibit unwanted uncertainty, a phenomenon where a model changes a previously correct answer into an incorrect one when re-prompted. This behavior undermines trust and poses serious risks in high-stakes domains. In this work, we investigate the mechanisms that drive this phenomenon. We adapt the Needle-in-a-Haystack retrieval framework and integrate a Flip-style re-evaluation prompt to simulate realistic answer-flipping scenarios. We find that retrieval heads are not primarily responsible for avoiding uncertainty. Instead, we identify a small set of non-retrieval attention heads that disproportionately attend to misleading tokens in uncertain contexts. Masking these heads yields significant improvements, reducing flip behavior by up to 15% without introducing incoherence or overcorrection. However, when tested for downstream tasks, we observe trade-offs with flip behavior. Our findings contribute to the growing field of mechanistic interpretability and present a simple yet effective technique for mitigating uncertainty-driven failure modes in LLMs.
Problem

Research questions and friction points this paper is trying to address.

Investigating mechanisms causing LLMs to flip correct answers into incorrect ones
Identifying non-retrieval attention heads responsible for attending to misleading tokens
Developing masking techniques to reduce flip behavior without introducing incoherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Flip-style prompts to simulate answer-flipping scenarios
Identifies non-retrieval attention heads causing uncertainty
Masks problematic heads to reduce flip behavior by 15%
🔎 Similar Papers
No similar papers found.