Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the robustness of open-source large language models (LLMs) in factual identification under adversarial disinformation inputs. We introduce the novel concept of “adversarial factuality” and propose an empirical evaluation framework based on multi-level confidence adversarial prompting, systematically assessing eight prominent open-source LLMs. Our findings reveal substantial inter-model variation in robustness, with LLaMA-3.1 (8B) achieving top performance. Contrary to expectations, while most models exhibit improved detection rates as adversarial confidence decreases, LLaMA-3.1 and Phi-3 show anomalous degradation. Furthermore, attack success rates increase markedly in niche-knowledge scenarios, uncovering a strong inverse correlation between knowledge popularity and adversarial vulnerability. These results establish a new paradigm for trustworthy LLM evaluation and provide empirically grounded insights for enhancing factual robustness against adversarial manipulation.

Technology Category

Application Category

📝 Abstract

Adversarial factuality refers to the deliberate insertion of misinformation into input prompts by an adversary, characterized by varying levels of expressed confidence. In this study, we systematically evaluate the performance of several open-source large language models (LLMs) when exposed to such adversarial inputs. Three tiers of adversarial confidence are considered: strongly confident, moderately confident, and limited confidence. Our analysis encompasses eight LLMs: LLaMA 3.1 (8B), Phi 3 (3.8B), Qwen 2.5 (7B), Deepseek-v2 (16B), Gemma2 (9B), Falcon (7B), Mistrallite (7B), and LLaVA (7B). Empirical results indicate that LLaMA 3.1 (8B) exhibits a robust capability in detecting adversarial inputs, whereas Falcon (7B) shows comparatively lower performance. Notably, for the majority of the models, detection success improves as the adversary's confidence decreases; however, this trend is reversed for LLaMA 3.1 (8B) and Phi 3 (3.8B), where a reduction in adversarial confidence corresponds with diminished detection performance. Further analysis of the queries that elicited the highest and lowest rates of successful attacks reveals that adversarial attacks are more effective when targeting less commonly referenced or obscure information.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' ability to detect adversarial misinformation.

Assesses performance across varying adversarial confidence levels.

Identifies effective attack strategies targeting obscure information.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs against adversarial misinformation inputs.

Tests three confidence levels in adversarial attacks.

Identifies LLaMA 3.1 as robust against adversarial inputs.

🔎 Similar Papers

FacLens: Transferable Probe for Foreseeing Non-Factuality in Large Language Models