🤖 AI Summary
This work addresses the vulnerability of large language models (LLMs) to high-order deceptive evidence—logically coherent yet factually incorrect inputs that can mislead internal beliefs and compromise downstream decision reliability. To systematically investigate this issue, we introduce the MisBelief framework, which leverages multi-agent LLM collaboration to generate progressively deceptive evidence, thereby revealing LLMs’ susceptibility to such manipulation for the first time. We further propose a Deceptive Intent Shielding (DIS) mechanism that enables early detection and mitigation of belief distortion by reasoning about deceptive intent. Evaluated on 4,800 test instances, our experiments show that sophisticated deceptive evidence increases LLM belief scores by an average of 93.0%, while DIS effectively attenuates this bias, significantly enhancing model robustness and epistemic caution against complex misinformation.
📝 Abstract
To reliably assist human decision-making, LLMs must maintain factual internal beliefs against misleading injections. While current models resist explicit misinformation, we uncover a fundamental vulnerability to sophisticated, hard-to-falsify evidence. To systematically probe this weakness, we introduce MisBelief, a framework that generates misleading evidence via collaborative, multi-round interactions among multi-role LLMs. This process mimics subtle, defeasible reasoning and progressive refinement to create logically persuasive yet factually deceptive claims. Using MisBelief, we generate 4,800 instances across three difficulty levels to evaluate 7 representative LLMs. Results indicate that while models are robust to direct misinformation, they are highly sensitive to this refined evidence: belief scores in falsehoods increase by an average of 93.0\%, fundamentally compromising downstream recommendations. To address this, we propose Deceptive Intent Shielding (DIS), a governance mechanism that provides an early warning signal by inferring the deceptive intent behind evidence. Empirical results demonstrate that DIS consistently mitigates belief shifts and promotes more cautious evidence evaluation.