Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the vulnerability of AI-generated text detectors to paraphrasing attacks and their limited robustness, this paper proposes a training-free, detector-guided adversarial rewriting framework. Our method leverages instruction-tuned large language models (e.g., Llama-3) via black-box API calls, dynamically optimizing generated text under closed-loop feedback from multiple detectors—including RADAR, Fast-DetectGPT—thereby evading neural-network-based, watermarking, and zero-shot detection paradigms. We introduce the novel “detector-guided” paradigm, which significantly enhances cross-architectural attack transferability and overcomes the susceptibility of conventional paraphrasing methods to counter-detection. Experiments demonstrate that, against the OpenAI-RoBERTa-Large detector, our approach reduces the average Top-1% false positive rate (T@1%F) by 87.88%, with a maximum reduction of 98.96%, while preserving high textual naturalness and quality.

Technology Category

Application Category

📝 Abstract

The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack--which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-DetectGPT--adversarial paraphrasing, guided by OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-DetectGPT. Across a diverse set of detectors--including neural network-based, watermark-based, and zero-shot approaches--our attack achieves an average T@1%F reduction of 87.88% under the guidance of OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and attack success to find that our method can significantly reduce detection rates, with mostly a slight degradation in text quality. Our adversarial setup highlights the need for more robust and resilient detection strategies in the light of increasingly sophisticated evasion techniques.

Problem

Research questions and friction points this paper is trying to address.

Evading AI-generated text detection via adversarial paraphrasing

Improving robustness of adversarial attacks against diverse detectors

Balancing text quality and detection evasion effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free attack framework humanizes AI text

Uses off-the-shelf LLM guided by detector

Reduces detection rates significantly across systems

🔎 Similar Papers

Can AI-Generated Text be Reliably Detected?