🤖 AI Summary
To address the growing sophistication of LLM-generated disinformation and the limitations of existing detection methods in adapting to its dynamic evolution, this paper proposes a symbolic adversarial learning framework. It features co-evolving generative and detection agents engaged in structured debate: the former constructs logically coherent yet deceptive narratives, while the latter identifies factual and logical inconsistencies via symbolic reasoning. Crucially, the framework represents model weights, loss, and gradients as natural-language symbols—replacing conventional parameter updates with interpretable, evolvable symbolic inference. Integrated with multilingual prompt optimization, the framework significantly enhances both attack and defense capabilities across Chinese and English datasets: generated samples reduce the accuracy of mainstream detectors by 53.4% (Chinese) and 34.2% (English), while the detector achieves a 7.7% improvement in identifying novel disinformation.
📝 Abstract
Rapid LLM advancements heighten fake news risks by enabling the automatic generation of increasingly sophisticated misinformation. Previous detection methods, including fine-tuned small models or LLM-based detectors, often struggle with its dynamically evolving nature. In this work, we propose a novel framework called the Symbolic Adversarial Learning Framework (SALF), which implements an adversarial training paradigm by an agent symbolic learning optimization process, rather than relying on numerical updates. SALF introduces a paradigm where the generation agent crafts deceptive narratives, and the detection agent uses structured debates to identify logical and factual flaws for detection, and they iteratively refine themselves through such adversarial interactions. Unlike traditional neural updates, we represent agents using agent symbolic learning, where learnable weights are defined by agent prompts, and simulate back-propagation and gradient descent by operating on natural language representations of weights, loss, and gradients. Experiments on two multilingual benchmark datasets demonstrate SALF's effectiveness, showing it generates sophisticated fake news that degrades state-of-the-art detection performance by up to 53.4% in Chinese and 34.2% in English on average. SALF also refines detectors, improving detection of refined content by up to 7.7%. We hope our work inspires further exploration into more robust, adaptable fake news detection systems.