🤖 AI Summary
Existing AIG text detectors exhibit insufficient robustness in real-world scenarios, particularly against texts generated via few-shot/one-shot prompting or domain-specific continual pre-training (CPT) of large language models. Method: We introduce DACTYL—the first benchmark explicitly designed to evaluate detectors under realistic adversarial challenges—systematically incorporating one-shot, few-shot, and CPT-generated texts. To enhance out-of-distribution generalization, we employ full-parameter efficient fine-tuning for synthetic data generation and propose a novel classifier training paradigm integrating Deep X-risk Optimization (DXO) with binary cross-entropy loss. Contribution/Results: Experiments reveal severe performance degradation of mainstream detectors on DACTYL. Our DXO-based model achieves a macro-F1 score 50.56 points higher than baseline methods on student essay detection, demonstrating superior generalization and practical utility in authentic deployment settings.
📝 Abstract
Existing AIG (AI-generated) text detectors struggle in real-world settings despite succeeding in internal testing, suggesting that they may not be robust enough. We rigorously examine the machine-learning procedure to build these detectors to address this. Most current AIG text detection datasets focus on zero-shot generations, but little work has been done on few-shot or one-shot generations, where LLMs are given human texts as an example. In response, we introduce the Diverse Adversarial Corpus of Texts Yielded from Language models (DACTYL), a challenging AIG text detection dataset focusing on one-shot/few-shot generations. We also include texts from domain-specific continued-pre-trained (CPT) language models, where we fully train all parameters using a memory-efficient optimization approach. Many existing AIG text detectors struggle significantly on our dataset, indicating a potential vulnerability to one-shot/few-shot and CPT-generated texts. We also train our own classifiers using two approaches: standard binary cross-entropy (BCE) optimization and a more recent approach, deep X-risk optimization (DXO). While BCE-trained classifiers marginally outperform DXO classifiers on the DACTYL test set, the latter excels on out-of-distribution (OOD) texts. In our mock deployment scenario in student essay detection with an OOD student essay dataset, the best DXO classifier outscored the best BCE-trained classifier by 50.56 macro-F1 score points at the lowest false positive rates for both. Our results indicate that DXO classifiers generalize better without overfitting to the test set. Our experiments highlight several areas of improvement for AIG text detectors.