MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts

📅 2024-06-18
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Existing machine-generated text detection research primarily targets long-form English texts, struggling with the linguistic irregularities—such as informal language, grammatical errors, emojis, and hashtags—prevalent in social media short texts, and lacks multilingual benchmarks. Method: We introduce the first social-media-oriented, multilingual (22 languages), multi-platform (5 platforms) benchmark for short-text detection, comprising 470K human-written and LLM-generated (from 7 models) short posts. Contribution/Results: This work systematically fills critical gaps in short-text, multilingual, and informal-language detection. We empirically reveal that platform choice significantly impacts detector performance. Through zero-shot transfer and fine-tuning experiments, we demonstrate strong cross-lingual and cross-platform generalization—fine-tuning yields substantial gains, while platform-specific adaptation proves essential for robustness, underscoring the decisive role of domain alignment in real-world deployment.

Technology Category

Application Category

📝 Abstract
Recent LLMs are able to generate high-quality multilingual texts, indistinguishable for humans from authentic human-written ones. Research in machine-generated text detection is however mostly focused on the English language and longer texts, such as news articles, scientific papers or student essays. Social-media texts are usually much shorter and often feature informal language, grammatical errors, or distinct linguistic items (e.g., emoticons, hashtags). There is a gap in studying the ability of existing methods in detection of such texts, reflected also in the lack of existing multilingual benchmark datasets. To fill this gap we propose the first multilingual (22 languages) and multi-platform (5 social media platforms) dataset for benchmarking machine-generated text detection in the social-media domain, called MultiSocial. It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual LLMs. We use this benchmark to compare existing detection methods in zero-shot as well as fine-tuned form. Our results indicate that the fine-tuned detectors have no problem to be trained on social-media texts and that the platform selection for training matters.
Problem

Research questions and friction points this paper is trying to address.

Detecting machine-generated multilingual social-media texts is challenging
Existing methods lack focus on short informal social-media content
No multilingual benchmark datasets exist for social-media text detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual dataset for social-media text detection
Benchmarking 7 multilingual LLMs on 22 languages
Fine-tuned detectors effective on social-media texts
🔎 Similar Papers
No similar papers found.