WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia

📅 2025-07-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Wikipedia faces a growing threat from low-quality text generated by large language models (LLMs), yet existing machine-generated text (MGT) detectors are predominantly evaluated on generic benchmarks and exhibit poor generalization to real-world editing tasks—such as paragraph writing, summarization, and style transfer. Method: We introduce WikiMGT, the first multilingual, multi-generator MGT detection benchmark specifically designed for Wikipedia editing tasks, accompanied by a cross-lingual annotated dataset covering the three aforementioned editing scenarios. Contribution/Results: Empirical evaluation reveals that fine-tuned detectors achieve only 78% average accuracy, while zero-shot methods drop to 58%—substantially underperforming on generic benchmarks. These results underscore the critical need for task-specific evaluation and establish WikiMGT as a foundational step toward editor-centric, standardized MGT detection assessment.

Technology Category

Application Category

📝 Abstract
Given Wikipedia's role as a trusted source of high-quality, reliable content, concerns are growing about the proliferation of low-quality machine-generated text (MGT) produced by large language models (LLMs) on its platform. Reliable detection of MGT is therefore essential. However, existing work primarily evaluates MGT detectors on generic generation tasks rather than on tasks more commonly performed by Wikipedia editors. This misalignment can lead to poor generalisability when applied in real-world Wikipedia contexts. We introduce WETBench, a multilingual, multi-generator, and task-specific benchmark for MGT detection. We define three editing tasks, empirically grounded in Wikipedia editors' perceived use cases for LLM-assisted editing: Paragraph Writing, Summarisation, and Text Style Transfer, which we implement using two new datasets across three languages. For each writing task, we evaluate three prompts, generate MGT across multiple generators using the best-performing prompt, and benchmark diverse detectors. We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%. These results show that detectors struggle with MGT in realistic generation scenarios and underscore the importance of evaluating such models on diverse, task-specific data to assess their reliability in editor-driven contexts.
Problem

Research questions and friction points this paper is trying to address.

Detecting machine-generated text in Wikipedia editing tasks
Evaluating MGT detectors on task-specific, editor-driven scenarios
Assessing detector reliability across multilingual, multi-generator contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual multi-generator task-specific benchmark
Three editing tasks with new datasets
Training-based and zero-shot detectors evaluated
🔎 Similar Papers
No similar papers found.