🤖 AI Summary
The growing risk of large language models (LLMs) amplifying misinformation remains poorly quantified across multilingual, real-world contexts. Method: We propose a reproducible detection and attribution framework integrating zero-shot classification, language-specific watermark analysis, temporal comparative statistics, and cross-platform metadata alignment—applied to the first real-world multilingual misinformation dataset. Results: Empirical analysis reveals that LLM-generated content constituted an average of 37% of mainstream-language misinformation in 2023–2024, rising to 62% in select low-resource languages and encrypted platforms; we further identify pronounced platform migration and linguistic asymmetry in diffusion patterns. This work bridges the academic divide between alarmist “threat exaggeration” and neglect of “long-tail risks,” establishing the first cross-lingual empirical benchmark and methodological foundation for governing multilingual LLM-generated content.
📝 Abstract
Increased sophistication of large language models (LLMs) and the consequent quality of generated multilingual text raises concerns about potential disinformation misuse. While humans struggle to distinguish LLM-generated content from human-written texts, the scholarly debate about their impact remains divided. Some argue that heightened fears are overblown due to natural ecosystem limitations, while others contend that specific"longtail"contexts face overlooked risks. Our study bridges this debate by providing the first empirical evidence of LLM presence in the latest real-world disinformation datasets, documenting the increase of machine-generated content following ChatGPT's release, and revealing crucial patterns across languages, platforms, and time periods.