π€ AI Summary
Current methods for detecting text generated by large language models (LLMs) lack reliable evaluation benchmarks in multilingual and real-world settings. To address this gap, this work proposes DetectRL-Xβthe first multilingual, multidimensional detection benchmark tailored for real-world applications. DetectRL-X encompasses eight major languages and six high-risk domains, integrating texts produced by commercial LLMs, common AI-assisted editing operations (e.g., polishing, expansion, compression), and multilingual paraphrasing and perturbation-based attack strategies. The benchmark enables fine-grained analysis and stress testing of detectors, systematically revealing performance disparities of existing methods across languages, domains, generators, and adversarial conditions. Empirical evaluations demonstrate that DetectRL-X provides critical infrastructure for advancing multilingual LLM-generated text detection.
π Abstract
The effective detection and governance of Large Language Model (LLM) generated content has become increasingly critical due to the growing risk of misuse. Despite the impressive performance of existing detectors, their reliability and potential in multilingual, real-world scenarios remain largely underexplored. In this study, we introduce DetectRL-X, a comprehensive multilingual benchmark designed to evaluate advanced detectors across 8 dimensions. The benchmark encompasses 8 languages commonly used in commercial contexts and collects human-written texts from 6 domains highly susceptible to LLM misuse. To better aligned with real-world applications, We create LLM-generated texts using 4 popular commercial LLMs, and include typical AI-assisted writing operations such as polishing, expanding, and condensing to capture authentic usage patterns. Furthermore, we develop a multilingual framework for paraphrasing and perturbation attacks to simulate diverse human modifications and writing noise, enabling stress testing of detectors across languages. Experimental results on DetectRL-X reveal the strengths and limitations of current state-of-the-art detectors when applied to diverse linguistic resources. We further analyze how domains, generators, attack strategies, text length, and refinement operations influence performance in different languages, underscoring DetectRL-X as an effective benchmark for strengthening multilingual and language-specific detectors.