🤖 AI Summary
This study investigates how large language model (LLM)-rewritten disinformation evades existing detection systems. We construct a novel benchmark dataset comprising paraphrased samples generated by multiple LLMs (GPT, Llama, Claude), annotated with fine-grained sentiment and semantic scores. Our methodology integrates LIME-based interpretability analysis, BERTScore evaluation, and comprehensive benchmarking across diverse detection models. We identify sentiment shift—not merely semantic distortion—as the primary mechanism underlying detection failure, challenging the assumption that high BERTScore implies semantic fidelity. Accordingly, we propose “sentiment consistency” as a new evaluation dimension for paraphrase quality in disinformation contexts. Experiments demonstrate that LLM rewriting substantially degrades detector accuracy, and reveal systematic trade-offs across models between evasion capability and meaning preservation. To foster reproducible research, we publicly release our enhanced dataset. This work provides both theoretical insights into adversarial robustness and a rigorous evaluation framework for next-generation disinformation detection systems.
📝 Abstract
With their advanced capabilities, Large Language Models (LLMs) can generate highly convincing and contextually relevant fake news, which can contribute to disseminating misinformation. Though there is much research on fake news detection for human-written text, the field of detecting LLM-generated fake news is still under-explored. This research measures the efficacy of detectors in identifying LLM-paraphrased fake news, in particular, determining whether adding a paraphrase step in the detection pipeline helps or impedes detection. This study contributes: (1) Detectors struggle to detect LLM-paraphrased fake news more than human-written text, (2) We find which models excel at which tasks (evading detection, paraphrasing to evade detection, and paraphrasing for semantic similarity). (3) Via LIME explanations, we discovered a possible reason for detection failures: sentiment shift. (4) We discover a worrisome trend for paraphrase quality measurement: samples that exhibit sentiment shift despite a high BERTSCORE. (5) We provide a pair of datasets augmenting existing datasets with paraphrase outputs and scores. The dataset is available on GitHub