Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing MGT detectors exhibit insufficient robustness against linguistic style transfer attacks in the context of generative AI misuse. Method: We propose the first linguistics-guided, systematic stress-testing framework, innovatively leveraging Direct Preference Optimization (DPO) for controllable style transfer to steer LLM-generated text toward human-like expression—thereby exposing detectors’ overreliance on superficial linguistic features. Our framework integrates multi-model detection (Mage, Radar, LLM-DetectAIve), fine-grained linguistic feature analysis, and an adversarial fine-tuning pipeline requiring only minimal samples. Contribution/Results: The framework significantly degrades state-of-the-art detector accuracy with few-shot perturbations. Crucially, our interpretable style-shift analysis mechanism provides a principled benchmark for diagnosing detection biases and empirically supports the development of semantics-aware, robust detection methods grounded in deep linguistic understanding.

Technology Category

Application Category

📝 Abstract

Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.

Problem

Research questions and friction points this paper is trying to address.

Testing resilience of machine-generated text detectors to adversarial attacks

Shifting language model style to mimic human writing and fool detectors

Analyzing linguistic shifts and detector features for improved robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning LLMs with DPO to shift style

Exploiting stylistic clues to fool detectors

Analyzing linguistic shifts for detection features

🔎 Similar Papers

Detecting AI-Generated Text: Factors Influencing Detectability with Current Methods