🤖 AI Summary
Current machine-generated text detection research is heavily skewed toward English, with a lack of systematic methodologies and benchmarks for Central and Eastern European (CEE) languages. Method: We introduce CE-MGTB—the first multi-domain, multi-generator, multilingual, and adversarially robust detection benchmark for CEE languages—and propose a robust supervised fine-tuning framework based on multilingual pre-trained models. We systematically evaluate cross-lingual transfer and demonstrate the superiority of native-language fine-tuning. Contribution/Results: Experiments across diverse generators, domains, and confusion-based adversarial attacks show that fine-tuning solely on target CEE language data achieves both optimal detection accuracy and strongest robustness—significantly outperforming cross-lingual transfer baselines. CE-MGTB fills a critical research gap in AIGC detection for CEE languages and provides a reproducible benchmark and effective paradigm for non-English AIGC detection.
📝 Abstract
Machine-generated text detection, as an important task, is predominantly focused on English in research. This makes the existing detectors almost unusable for non-English languages, relying purely on cross-lingual transferability. There exist only a few works focused on any of Central European languages, leaving the transferability towards these languages rather unexplored. We fill this gap by providing the first benchmark of detection methods focused on this region, while also providing comparison of train-languages combinations to identify the best performing ones. We focus on multi-domain, multi-generator, and multilingual evaluation, pinpointing the differences of individual aspects, as well as adversarial robustness of detection methods. Supervised finetuned detectors in the Central European languages are found the most performant in these languages as well as the most resistant against obfuscation.