🤖 AI Summary
Wikipedia faces a growing threat from low-quality text generated by large language models (LLMs), yet existing machine-generated text (MGT) detectors are predominantly evaluated on generic benchmarks and exhibit poor generalization to real-world editing tasks—such as paragraph writing, summarization, and style transfer. Method: We introduce WikiMGT, the first multilingual, multi-generator MGT detection benchmark specifically designed for Wikipedia editing tasks, accompanied by a cross-lingual annotated dataset covering the three aforementioned editing scenarios. Contribution/Results: Empirical evaluation reveals that fine-tuned detectors achieve only 78% average accuracy, while zero-shot methods drop to 58%—substantially underperforming on generic benchmarks. These results underscore the critical need for task-specific evaluation and establish WikiMGT as a foundational step toward editor-centric, standardized MGT detection assessment.
📝 Abstract
Given Wikipedia's role as a trusted source of high-quality, reliable content, concerns are growing about the proliferation of low-quality machine-generated text (MGT) produced by large language models (LLMs) on its platform. Reliable detection of MGT is therefore essential. However, existing work primarily evaluates MGT detectors on generic generation tasks rather than on tasks more commonly performed by Wikipedia editors. This misalignment can lead to poor generalisability when applied in real-world Wikipedia contexts. We introduce WETBench, a multilingual, multi-generator, and task-specific benchmark for MGT detection. We define three editing tasks, empirically grounded in Wikipedia editors' perceived use cases for LLM-assisted editing: Paragraph Writing, Summarisation, and Text Style Transfer, which we implement using two new datasets across three languages. For each writing task, we evaluate three prompts, generate MGT across multiple generators using the best-performing prompt, and benchmark diverse detectors. We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%. These results show that detectors struggle with MGT in realistic generation scenarios and underscore the importance of evaluating such models on diverse, task-specific data to assess their reliability in editor-driven contexts.