When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a persistent hallucination phenomenon in large language models (LLMs) induced by spurious correlations in training data—e.g., superficial statistical associations between surnames and nationalities—leading models to generate incorrect answers with high confidence. Critically, this behavior is scale-invariant, resistant to standard confidence-based filtering and internal state probing, and persists even after refusal fine-tuning. Method: We introduce a novel methodology combining synthetic controlled experiments with theoretical analysis to systematically evaluate mainstream open- and closed-source models—including GPT-5—under spurious correlation scenarios. Contribution/Results: Our analysis reveals, for the first time, how spurious correlations mechanistically undermine the reliability of both model confidence and internal representations. The hallucinations exhibit strong robustness and cross-model generality. Consequently, we argue that hallucination detection must shift toward statistically aware paradigms explicitly designed to mitigate bias-induced failures.

Technology Category

Application Category

📝 Abstract
Despite substantial advances, large language models (LLMs) continue to exhibit hallucinations, generating plausible yet incorrect responses. In this paper, we highlight a critical yet previously underexplored class of hallucinations driven by spurious correlations -- superficial but statistically prominent associations between features (e.g., surnames) and attributes (e.g., nationality) present in the training data. We demonstrate that these spurious correlations induce hallucinations that are confidently generated, immune to model scaling, evade current detection methods, and persist even after refusal fine-tuning. Through systematically controlled synthetic experiments and empirical evaluations on state-of-the-art open-source and proprietary LLMs (including GPT-5), we show that existing hallucination detection methods, such as confidence-based filtering and inner-state probing, fundamentally fail in the presence of spurious correlations. Our theoretical analysis further elucidates why these statistical biases intrinsically undermine confidence-based detection techniques. Our findings thus emphasize the urgent need for new approaches explicitly designed to address hallucinations caused by spurious correlations.
Problem

Research questions and friction points this paper is trying to address.

Spurious correlations cause undetectable hallucinations in LLMs
Current detection methods fail against statistically biased outputs
New approaches needed for correlation-induced hallucination detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Detect hallucinations from spurious correlations in data
Test detection methods with controlled synthetic experiments
Propose new approaches for bias-induced hallucination detection
🔎 Similar Papers
No similar papers found.
Shaowen Wang
Shaowen Wang
Professor, University of Illinois Urbana-Champaign
CyberGISGeospatial Data ScienceSpatial AISpatial AnalysisSustainability
Y
Yiqi Dong
Institute for Interdisciplinary Information Sciences, Tsinghua University
R
Ruinian Chang
Institute for Interdisciplinary Information Sciences, Tsinghua University
Tansheng Zhu
Tansheng Zhu
IIIS, Tsinghua University
Machine Learning Theory
Y
Yuebo Sun
Institute for Interdisciplinary Information Sciences, Tsinghua University
Kaifeng Lyu
Kaifeng Lyu
Tsinghua University
J
Jian Li
Institute for Interdisciplinary Information Sciences, Tsinghua University