AuthorMist: Evading AI Text Detectors with Reinforcement Learning

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the vulnerability of AI-generated text to detection, which threatens author privacy and creative freedom. To mitigate this, we propose a reinforcement learning–based humanization rewriting method. Methodologically, we introduce the novel “API-as-reward” mechanism, leveraging real-time outputs from black-box detectors (e.g., GPTZero, WinstonAI) as sparse reward signals. We further design Group Relative Policy Optimization (GPRO), a sample-efficient policy optimization algorithm tailored for large language models (3B parameters), augmented with semantic fidelity constraints to preserve meaning. Experiments demonstrate that our approach achieves evasion success rates of 78.6%–96.2% across mainstream detectors while maintaining semantic similarity >0.94—significantly outperforming existing baselines. These results expose fundamental robustness deficiencies in current AI text detectors.

Technology Category

Application Category

📝 Abstract
In the age of powerful AI-generated text, automatic detectors have emerged to identify machine-written content. This poses a threat to author privacy and freedom, as text authored with AI assistance may be unfairly flagged. We propose AuthorMist, a novel reinforcement learning-based system to transform AI-generated text into human-like writing. AuthorMist leverages a 3-billion-parameter language model as a backbone, fine-tuned with Group Relative Policy Optimization (GPRO) to paraphrase text in a way that evades AI detectors. Our framework establishes a generic approach where external detector APIs (GPTZero, WinstonAI, Originality.ai, etc.) serve as reward functions within the reinforcement learning loop, enabling the model to systematically learn outputs that these detectors are less likely to classify as AI-generated. This API-as-reward methodology can be applied broadly to optimize text against any detector with an accessible interface. Experiments on multiple datasets and detectors demonstrate that AuthorMist effectively reduces the detectability of AI-generated text while preserving the original meaning. Our evaluation shows attack success rates ranging from 78.6% to 96.2% against individual detectors, significantly outperforming baseline paraphrasing methods. AuthorMist maintains high semantic similarity (above 0.94) with the original text while successfully evading detection. These results highlight limitations in current AI text detection technologies and raise questions about the sustainability of the detection-evasion arms race.
Problem

Research questions and friction points this paper is trying to address.

Evading AI text detectors using reinforcement learning
Transforming AI-generated text into human-like writing
Preserving semantic similarity while reducing detectability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning transforms AI text.
Group Relative Policy Optimization fine-tunes model.
API-as-reward evades multiple AI detectors.
🔎 Similar Papers
No similar papers found.