CAMOUFLAGE: Exploiting Misinformation Detection Systems Through LLM-driven Adversarial Claim Transformation

📅 2025-05-03

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing evidence-driven fake news detection systems exhibit multi-component adversarial fragility, particularly lacking effective methods to jointly attack the retrieval and claim-evidence alignment modules. Method: We propose the first black-box adversarial attack framework tailored to multi-component architectures, leveraging dual-agent LLM collaboration to perform semantic-preserving, structured claim rewriting that simultaneously disrupts both evidence retrieval and claim-evidence alignment. We further introduce a binary-decision-feedback-driven iterative prompt optimization mechanism, overcoming the limitations of token-level perturbations. Contribution/Results: Evaluated on four real-world systems—including academic models and industrial APIs—our framework achieves an average attack success rate of 46.92%, while rigorously preserving semantic equivalence and textual coherence.

Technology Category

Application Category

📝 Abstract

Automated evidence-based misinformation detection systems, which evaluate the veracity of short claims against evidence, lack comprehensive analysis of their adversarial vulnerabilities. Existing black-box text-based adversarial attacks are ill-suited for evidence-based misinformation detection systems, as these attacks primarily focus on token-level substitutions involving gradient or logit-based optimization strategies, which are incapable of fooling the multi-component nature of these detection systems. These systems incorporate both retrieval and claim-evidence comparison modules, which requires attacks to break the retrieval of evidence and/or the comparison module so that it draws incorrect inferences. We present CAMOUFLAGE, an iterative, LLM-driven approach that employs a two-agent system, a Prompt Optimization Agent and an Attacker Agent, to create adversarial claim rewritings that manipulate evidence retrieval and mislead claim-evidence comparison, effectively bypassing the system without altering the meaning of the claim. The Attacker Agent produces semantically equivalent rewrites that attempt to mislead detectors, while the Prompt Optimization Agent analyzes failed attack attempts and refines the prompt of the Attacker to guide subsequent rewrites. This enables larger structural and stylistic transformations of the text rather than token-level substitutions, adapting the magnitude of changes based on previous outcomes. Unlike existing approaches, CAMOUFLAGE optimizes its attack solely based on binary model decisions to guide its rewriting process, eliminating the need for classifier logits or extensive querying. We evaluate CAMOUFLAGE on four systems, including two recent academic systems and two real-world APIs, with an average attack success rate of 46.92% while preserving textual coherence and semantic equivalence to the original claims.

Problem

Research questions and friction points this paper is trying to address.

Exploiting vulnerabilities in evidence-based misinformation detection systems

Creating adversarial claims to mislead retrieval and comparison modules

Bypassing detection without altering claim meaning using LLM-driven rewrites

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven adversarial claim transformation

Two-agent system for iterative attack optimization

Binary model decisions guide rewriting process

🔎 Similar Papers

Adversarial Style Augmentation via Large Language Model for Robust Fake News Detection