🤖 AI Summary
Current AI-generated text detectors achieve over 90% accuracy on original LLM outputs but fail catastrophically against iterative paraphrasing—semantic-preserving multi-round rewrites. We identify the root cause as detector overreliance on superficial statistical features, rendering them vulnerable to an intermediate “whitewashing” regime where semantic meaning shifts while generative patterns persist. To address this, we propose PADBen—the first robustness evaluation benchmark targeting two distinct rewriting attacks: authorship obfuscation and plagiarism evasion. PADBen introduces a five-category text classification taxonomy and five progressive detection tasks, and employs an intrinsic-mechanism-guided, multi-stage human-AI collaborative adversarial sample generation strategy. Comprehensive evaluation across 11 state-of-the-art detectors reveals that while most withstand plagiarism evasion, all collapse under authorship obfuscation—exposing fundamental architectural limitations in current detection paradigms.
📝 Abstract
While AI-generated text (AIGT) detectors achieve over 90% accuracy on direct LLM outputs, they fail catastrophically against iteratively-paraphrased content. We investigate why iteratively-paraphrased text -- itself AI-generated -- evades detection systems designed for AIGT identification. Through intrinsic mechanism analysis, we reveal that iterative paraphrasing creates an intermediate laundering region characterized by semantic displacement with preserved generation patterns, which brings up two attack categories: paraphrasing human-authored text (authorship obfuscation) and paraphrasing LLM-generated text (plagiarism evasion). To address these vulnerabilities, we introduce PADBen, the first benchmark systematically evaluating detector robustness against both paraphrase attack scenarios. PADBen comprises a five-type text taxonomy capturing the full trajectory from original content to deeply laundered text, and five progressive detection tasks across sentence-pair and single-sentence challenges. We evaluate 11 state-of-the-art detectors, revealing critical asymmetry: detectors successfully identify the plagiarism evasion problem but fail for the case of authorship obfuscation. Our findings demonstrate that current detection approaches cannot effectively handle the intermediate laundering region, necessitating fundamental advances in detection architectures beyond existing semantic and stylistic discrimination methods. For detailed code implementation, please see https://github.com/JonathanZha47/PadBen-Paraphrase-Attack-Benchmark.