TH-Bench: Evaluating Evading Attacks via Humanizing AI Text on Machine-Generated Text Detectors

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations of machine-generated text (MGT) detectors lack standardized benchmarks and multidimensional, synergistic analysis, particularly regarding their vulnerability to humanization-based evasion attacks. Method: We introduce TH-Bench—the first comprehensive benchmark for evaluating evasion attacks against AI text detectors—incorporating six state-of-the-art attacks, 13 detectors, six datasets, 19 domains, and text generated by 11 mainstream LLMs. We propose a novel three-dimensional evaluation framework jointly measuring evasion effectiveness, textual quality, and computational overhead. Contribution/Results: Our framework systematically uncovers intrinsic trade-offs among these dimensions—revealing no single attack dominates across all three. We further identify and empirically validate two optimization pathways that improve robustness without compromising quality or efficiency. These findings establish a new paradigm for enhancing detector resilience and generation trustworthiness.

Technology Category

Application Category

📝 Abstract
As Large Language Models (LLMs) advance, Machine-Generated Texts (MGTs) have become increasingly fluent, high-quality, and informative. Existing wide-range MGT detectors are designed to identify MGTs to prevent the spread of plagiarism and misinformation. However, adversaries attempt to humanize MGTs to evade detection (named evading attacks), which requires only minor modifications to bypass MGT detectors. Unfortunately, existing attacks generally lack a unified and comprehensive evaluation framework, as they are assessed using different experimental settings, model architectures, and datasets. To fill this gap, we introduce the Text-Humanization Benchmark (TH-Bench), the first comprehensive benchmark to evaluate evading attacks against MGT detectors. TH-Bench evaluates attacks across three key dimensions: evading effectiveness, text quality, and computational overhead. Our extensive experiments evaluate 6 state-of-the-art attacks against 13 MGT detectors across 6 datasets, spanning 19 domains and generated by 11 widely used LLMs. Our findings reveal that no single evading attack excels across all three dimensions. Through in-depth analysis, we highlight the strengths and limitations of different attacks. More importantly, we identify a trade-off among three dimensions and propose two optimization insights. Through preliminary experiments, we validate their correctness and effectiveness, offering potential directions for future research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating evading attacks on machine-generated text detectors
Introducing TH-Bench for comprehensive attack assessment
Analyzing trade-offs in evading effectiveness, text quality, and computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

TH-Bench evaluates evading attacks comprehensively.
Assesses attacks on effectiveness, quality, overhead.
Identifies trade-offs and proposes optimization insights.
🔎 Similar Papers
No similar papers found.