Vulnerability of LLMs'Belief Systems? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the vulnerability of large language models (LLMs) to adopting false beliefs through persuasive dialogue, which poses a critical threat to their reliability. Grounded in the SMCR communication framework, the authors design multi-turn strategic persuasion dialogues to systematically evaluate belief stability across factual, medical, and social bias domains in mainstream LLMs. They further investigate the impact of metacognitive prompting and adversarial fine-tuning on models’ resistance to persuasion. The work reveals several key findings: metacognitive prompting unexpectedly exacerbates belief fragility; smaller models exhibit belief change rates exceeding 80% after initial persuasion attempts; and adversarial fine-tuning substantially enhances robustness—e.g., GPT-4o-mini achieves 98.6% stability—though this improvement is highly dependent on model scale, with Llama-series models remaining below 14%.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly employed in various question-answering tasks. However, recent studies showcase that LLMs are susceptible to persuasion and could adopt counterfactual beliefs. We present a systematic evaluation of LLM susceptibility to persuasion under the Source--Message--Channel--Receiver (SMCR) communication framework. Across five mainstream Large Language Models (LLMs) and three domains (factual knowledge, medical QA, and social bias), we analyze how different persuasive strategies influence belief stability over multiple interaction turns. We further examine whether meta-cognition prompting (i.e., eliciting self-reported confidence) affects resistance to persuasion. Results show that the smallest model (Llama 3.2-3B) exhibits extreme compliance, with 82.5% of belief changes occurring at the first persuasive turn (average end turn of 1.1--1.4). Contrary to expectations, meta-cognition prompting increases vulnerability by accelerating belief erosion rather than enhancing robustness. Finally, we evaluate adversarial fine-tuning as a defense. While GPT-4o-mini achieves near-complete robustness (98.6%) and Mistral~7B improves substantially (35.7% $\rightarrow$ 79.3%), Llama models remain highly susceptible (<14%) even when fine-tuned on their own failure cases. Together, these findings highlight substantial model-dependent limits of current robustness interventions and offer guidance for developing more trustworthy LLMs.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
belief vulnerability
persuasive intervention
belief stability
model robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

belief resistance
persuasive intervention
meta-cognition prompting
adversarial fine-tuning
SMCR framework
🔎 Similar Papers
No similar papers found.