ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection

๐Ÿ“… 2025-10-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses two critical challenges in Arabic large language model (LLM) text detection: poor cross-genre generalization and the absence of high-quality benchmark datasets. To this end, we introduce ArabDetectโ€”the first large-scale, multi-genre (news, social media, reviews), multi-variant (Modern Standard Arabic and dialectal Arabic) human-vs.-machine text discrimination dataset for Arabic. ArabDetect employs multi-source sampling, balanced class and genre distributions, and standardized train/validation/test splits. Systematic evaluation reveals substantial performance degradation under cross-genre settings, with news texts proving most challenging to detect. Extensive experiments compare traditional classifiers, BERT-based models, and LLM-based zero-/few-shot approaches; fine-tuned BERT achieves the best overall accuracy, yet cross-genre robustness remains limited. This work establishes a reproducible benchmark for Arabic AIGC detection, delivers empirical insights into genre-specific detection difficulty, and offers methodological guidance for future research.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce ALHD, the first large-scale comprehensive Arabic dataset explicitly designed to distinguish between human- and LLM-generated texts. ALHD spans three genres (news, social media, reviews), covering both MSA and dialectal Arabic, and contains over 400K balanced samples generated by three leading LLMs and originated from multiple human sources, which enables studying generalizability in Arabic LLM-genearted text detection. We provide rigorous preprocessing, rich annotations, and standardized balanced splits to support reproducibility. In addition, we present, analyze and discuss benchmark experiments using our new dataset, in turn identifying gaps and proposing future research directions. Benchmarking across traditional classifiers, BERT-based models, and LLMs (zero-shot and few-shot) demonstrates that fine-tuned BERT models achieve competitive performance, outperforming LLM-based models. Results are however not always consistent, as we observe challenges when generalizing across genres; indeed, models struggle to generalize when they need to deal with unseen patterns in cross-genre settings, and these challenges are particularly prominent when dealing with news articles, where LLM-generated texts resemble human texts in style, which opens up avenues for future research. ALHD establishes a foundation for research related to Arabic LLM-detection and mitigating risks of misinformation, academic dishonesty, and cyber threats.
Problem

Research questions and friction points this paper is trying to address.

Detecting LLM-generated Arabic texts across multiple genres and dialects
Addressing generalization challenges in cross-genre Arabic text detection
Mitigating misinformation risks from Arabic LLM-generated content
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale Arabic dataset for LLM detection
Multi-genre coverage with MSA and dialects
Fine-tuned BERT models outperform LLM-based detection
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Ali Khairallah
School of Electronic Engineering and Computer Science, Queen Mary University of London, London, United Kingdom
Arkaitz Zubiaga
Arkaitz Zubiaga
Queen Mary University of London
Social Media MiningSocial Data ScienceNatural Language ProcessingComputational Social ScienceComputational Journalism