Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of large language model (LLM)-generated text to detection by automated classifiers. We propose CoPA, a training-free contrastive paraphrasing attack that leverages black-box LLMs via instruction engineering and implicit word-distribution modeling. CoPA is the first method to explicitly model the machine-generated word distribution as a subtractable auxiliary signal and integrate it into a contrastive decoding mechanism during generation to suppress detectable “machine artifacts.” Evaluated against state-of-the-art detectors—including DetectGPT and Fast-DetectGPT—across diverse text domains, CoPA achieves significant improvements in attack success rate while incurring zero training overhead. Its core innovations are: (i) a training-free contrastive decoding paradigm that dynamically steers token selection away from detector-sensitive patterns, and (ii) an operationalizable, explicit modeling of machine-induced lexical distributions as manipulable signals.

Technology Category

Application Category

📝 Abstract
The misuse of large language models (LLMs), such as academic plagiarism, has driven the development of detectors to identify LLM-generated texts. To bypass these detectors, paraphrase attacks have emerged to purposely rewrite these texts to evade detection. Despite the success, existing methods require substantial data and computational budgets to train a specialized paraphraser, and their attack efficacy greatly reduces when faced with advanced detection algorithms. To address this, we propose extbf{Co}ntrastive extbf{P}araphrase extbf{A}ttack (CoPA), a training-free method that effectively deceives text detectors using off-the-shelf LLMs. The first step is to carefully craft instructions that encourage LLMs to produce more human-like texts. Nonetheless, we observe that the inherent statistical biases of LLMs can still result in some generated texts carrying certain machine-like attributes that can be captured by detectors. To overcome this, CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by the LLM. By subtracting the machine-like patterns from the human-like distribution during the decoding process, CoPA is able to produce sentences that are less discernible by text detectors. Our theoretical analysis suggests the superiority of the proposed attack. Extensive experiments validate the effectiveness of CoPA in fooling text detectors across various scenarios.
Problem

Research questions and friction points this paper is trying to address.

Detecting LLM-generated texts to prevent misuse like plagiarism
Existing paraphrase attacks require heavy training resources
Overcoming inherent machine-like biases in LLM outputs for evasion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free method using off-the-shelf LLMs
Contrastive human-like and machine-like distributions
Instruction crafting for human-like text generation
🔎 Similar Papers
No similar papers found.
H
Hao Fang
Tsinghua Shenzhen International Graduate School, Tsinghua University
Jiawei Kong
Jiawei Kong
Tsinghua University
Trustworthy AI
T
Tianqu Zhuang
Tsinghua Shenzhen International Graduate School, Tsinghua University
Yixiang Qiu
Yixiang Qiu
Tsinghua Shenzhen International Graduate School
Trusuworthy AIComputer VisionDeep Learning
Kuofeng Gao
Kuofeng Gao
Tsinghua University
Large Language ModelTrustworthy AIBackdoor Learning
B
Bin Chen
Harbin Institute of Technology, Shenzhen, Pengcheng Laboratory
Shu-Tao Xia
Shu-Tao Xia
SIGS, Tsinghua University
coding and information theorymachine learningcomputer visionAI security
Yaowei Wang
Yaowei Wang
The Hong Kong Polytechnic University
M
Min Zhang
Harbin Institute of Technology, Shenzhen