🤖 AI Summary
Large language models (LLMs) suffer from “answer flipping”: repeated queries cause previously correct outputs to flip to incorrect ones, undermining reliability—especially in high-stakes applications. This work identifies, via mechanistic interpretability, critical non-retrieval attention heads that aberrantly attend to misleading tokens under uncertainty, constituting the primary cause of flipping. We propose a head-level masking intervention that selectively suppresses these heads, effectively mitigating flipping without degrading text coherence or inducing overcorrection. To enable controlled evaluation, we construct flip-prone scenarios using a Needle-in-a-Haystack retrieval framework and Flip-style re-evaluation prompts, coupled with attention attribution for precise targeting. Experiments across multiple settings show up to a 15% reduction in flipping rate. This is the first study to causally link specific attention heads to LLM uncertainty stability, offering an interpretable, intervention-aware pathway toward trustworthy LLM design.
📝 Abstract
Despite their impressive capabilities, Large Language Models (LLMs) exhibit unwanted uncertainty, a phenomenon where a model changes a previously correct answer into an incorrect one when re-prompted. This behavior undermines trust and poses serious risks in high-stakes domains. In this work, we investigate the mechanisms that drive this phenomenon. We adapt the Needle-in-a-Haystack retrieval framework and integrate a Flip-style re-evaluation prompt to simulate realistic answer-flipping scenarios. We find that retrieval heads are not primarily responsible for avoiding uncertainty. Instead, we identify a small set of non-retrieval attention heads that disproportionately attend to misleading tokens in uncertain contexts. Masking these heads yields significant improvements, reducing flip behavior by up to 15% without introducing incoherence or overcorrection. However, when tested for downstream tasks, we observe trade-offs with flip behavior. Our findings contribute to the growing field of mechanistic interpretability and present a simple yet effective technique for mitigating uncertainty-driven failure modes in LLMs.