Self-Disguise Attack: Induce the LLM to disguise itself for AIGT detection evasion

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AIGT detection evasion methods suffer from high computational overhead and degraded text quality. To address this, we propose the Self-Dressing Attack (SDA)—the first framework enabling large language models to autonomously generate human-like text with high detection resistance. SDA jointly optimizes adversarial feature extraction and retrieval-augmented in-context examples, integrated with controllable prompt engineering, to implicitly guide the model to evade detector-sensitive features—without external fine-tuning or post-processing. Evaluated on three mainstream LLM-generated text corpora, SDA reduces the average detection accuracy of six state-of-the-art AIGT detectors by up to 42.7%, while preserving linguistic fluency, semantic fidelity, and lexical diversity. This work establishes a new benchmark for evaluating AIGT detection robustness and introduces a practical, lightweight adversarial paradigm grounded in prompt-based, self-supervised evasion.

Technology Category

Application Category

📝 Abstract
AI-generated text (AIGT) detection evasion aims to reduce the detection probability of AIGT, helping to identify weaknesses in detectors and enhance their effectiveness and reliability in practical applications. Although existing evasion methods perform well, they suffer from high computational costs and text quality degradation. To address these challenges, we propose Self-Disguise Attack (SDA), a novel approach that enables Large Language Models (LLM) to actively disguise its output, reducing the likelihood of detection by classifiers. The SDA comprises two main components: the adversarial feature extractor and the retrieval-based context examples optimizer. The former generates disguise features that enable LLMs to understand how to produce more human-like text. The latter retrieves the most relevant examples from an external knowledge base as in-context examples, further enhancing the self-disguise ability of LLMs and mitigating the impact of the disguise process on the diversity of the generated text. The SDA directly employs prompts containing disguise features and optimized context examples to guide the LLM in generating detection-resistant text, thereby reducing resource consumption. Experimental results demonstrate that the SDA effectively reduces the average detection accuracy of various AIGT detectors across texts generated by three different LLMs, while maintaining the quality of AIGT.
Problem

Research questions and friction points this paper is trying to address.

Evade AI-generated text detection with reduced computational costs
Maintain text quality while avoiding detection by classifiers
Enhance LLM self-disguise ability using optimized context examples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Disguise Attack enables LLMs to actively evade detection
Adversarial feature extractor generates human-like disguise features
Retrieval-based optimizer enhances context with relevant examples
🔎 Similar Papers
No similar papers found.
Yinghan Zhou
Yinghan Zhou
China Agricultral University
J
Juan Wen
College of information electrical and engineering, China Agricultural University
W
Wanli Peng
College of information electrical and engineering, China Agricultural University
Zhengxian Wu
Zhengxian Wu
Tsinghua University
Computer Vision、Large Language Model
Z
Ziwei Zhang
College of information electrical and engineering, China Agricultural University
Yiming Xue
Yiming Xue
CAU
data hidingsignal processing