🤖 AI Summary
This study addresses the vulnerability of AI text detectors to character-level representations. We propose and empirically validate the first systematic black-box adversarial attack leveraging Unicode homoglyphs—visually identical yet semantically distinct characters. By applying fine-grained, semantics-preserving homoglyph substitutions to AI-generated text, our method degrades detector performance without compromising readability or meaning, reducing the Matthews Correlation Coefficient (MCC) from 0.64 to −0.01. The attack achieves near-total evasion across seven state-of-the-art detectors—including OpenAI’s official classifier and watermark-based detectors—and five diverse, cross-domain datasets, with average MCC approaching zero. Rigorous robustness evaluation and internal behavior attribution confirm strong generalization across models and datasets. Our work exposes a fundamental flaw in current detection paradigms: overreliance on superficial character-level signals. It provides critical insights and empirical evidence for developing more robust, semantics-aware AI content authentication mechanisms.
📝 Abstract
The advent of Large Language Models (LLMs) has enabled the generation of text that increasingly exhibits human-like characteristics. As the detection of such content is of significant importance, substantial research has been conducted with the objective of developing reliable AI-generated text detectors. These detectors have demonstrated promising results on test data, but recent research has revealed that they can be circumvented by employing different techniques. In this paper, we present homoglyph-based attacks (A $
ightarrow$ Cyrillic A) as a means of circumventing existing detectors. We conduct a comprehensive evaluation to assess the effectiveness of these attacks on seven detectors, including ArguGPT, Binoculars, DetectGPT, Fast-DetectGPT, Ghostbuster, OpenAI's detector, and watermarking techniques, on five different datasets. Our findings demonstrate that homoglyph-based attacks can effectively circumvent state-of-the-art detectors, leading them to classify all texts as either AI-generated or human-written (decreasing the average Matthews Correlation Coefficient from 0.64 to -0.01). Through further examination, we extract the technical justification underlying the success of the attacks, which varies across detectors. Finally, we discuss the implications of these findings and potential defenses against such attacks.