CEAID: Benchmark of Multilingual Machine-Generated Text Detection Methods for Central European Languages

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Current machine-generated text detection research is heavily skewed toward English, with a lack of systematic methodologies and benchmarks for Central and Eastern European (CEE) languages. Method: We introduce CE-MGTB—the first multi-domain, multi-generator, multilingual, and adversarially robust detection benchmark for CEE languages—and propose a robust supervised fine-tuning framework based on multilingual pre-trained models. We systematically evaluate cross-lingual transfer and demonstrate the superiority of native-language fine-tuning. Contribution/Results: Experiments across diverse generators, domains, and confusion-based adversarial attacks show that fine-tuning solely on target CEE language data achieves both optimal detection accuracy and strongest robustness—significantly outperforming cross-lingual transfer baselines. CE-MGTB fills a critical research gap in AIGC detection for CEE languages and provides a reproducible benchmark and effective paradigm for non-English AIGC detection.

Technology Category

Application Category

📝 Abstract

Machine-generated text detection, as an important task, is predominantly focused on English in research. This makes the existing detectors almost unusable for non-English languages, relying purely on cross-lingual transferability. There exist only a few works focused on any of Central European languages, leaving the transferability towards these languages rather unexplored. We fill this gap by providing the first benchmark of detection methods focused on this region, while also providing comparison of train-languages combinations to identify the best performing ones. We focus on multi-domain, multi-generator, and multilingual evaluation, pinpointing the differences of individual aspects, as well as adversarial robustness of detection methods. Supervised finetuned detectors in the Central European languages are found the most performant in these languages as well as the most resistant against obfuscation.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking machine-generated text detection for Central European languages

Evaluating cross-lingual transferability limitations for non-English languages

Assessing multilingual detection performance across domains and generators

Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for Central European language detection

Multilingual multi-generator evaluation with adversarial robustness

Supervised fine-tuning yields best performance and resistance

🔎 Similar Papers

No similar papers found.