🤖 AI Summary
This work addresses the insufficient robustness of LLM-generated text detectors. We propose a novel proxy-based decoding-layer attack paradigm that neither modifies the target detector nor alters the LLM’s parameters. Instead, it introduces a compact, human-like proxy model fine-tuned via Proximal Policy Optimization (PPO), which dynamically modulates the LLM’s output during decoding to generate highly human-like text that evades detection. The method supports both white-box and black-box settings and exhibits strong cross-model (e.g., Llama2-13B, Llama3-70B, Mixtral-8×7B), cross-domain, and cross-lingual generalization—while strictly preserving generation quality. On multiple benchmarks, mainstream detectors suffer an average AUROC reduction of 70.4%, with a maximum drop of 91.3% in cross-lingual settings. To our knowledge, this is the first approach achieving efficient, universal, and lossless detector evasion.
📝 Abstract
The advent of large language models (LLMs) has revolutionized the field of text generation, producing outputs that closely mimic human-like writing. Although academic and industrial institutions have developed detectors to prevent the malicious usage of LLM-generated texts, other research has doubt about the robustness of these systems. To stress test these detectors, we introduce a proxy-attack strategy that effortlessly compromises LLMs, causing them to produce outputs that align with human-written text and mislead detection systems. Our method attacks the source model by leveraging a reinforcement learning (RL) fine-tuned humanized small language model (SLM) in the decoding phase. Through an in-depth analysis, we demonstrate that our attack strategy is capable of generating responses that are indistinguishable to detectors, preventing them from differentiating between machine-generated and human-written text. We conduct systematic evaluations on extensive datasets using proxy-attacked open-source models, including Llama2-13B, Llama3-70B, and Mixtral-8*7B in both white- and black-box settings. Our findings show that the proxy-attack strategy effectively deceives the leading detectors, resulting in an average AUROC drop of 70.4% across multiple datasets, with a maximum drop of 90.3% on a single dataset. Furthermore, in cross-discipline scenarios, our strategy also bypasses these detectors, leading to a significant relative decrease of up to 90.9%, while in cross-language scenario, the drop reaches 91.3%. Despite our proxy-attack strategy successfully bypassing the detectors with such significant relative drops, we find that the generation quality of the attacked models remains preserved, even within a modest utility budget, when compared to the text produced by the original, unattacked source model.