ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses two critical challenges in Arabic large language model (LLM) text detection: poor cross-genre generalization and the absence of high-quality benchmark datasets. To this end, we introduce ArabDetect—the first large-scale, multi-genre (news, social media, reviews), multi-variant (Modern Standard Arabic and dialectal Arabic) human-vs.-machine text discrimination dataset for Arabic. ArabDetect employs multi-source sampling, balanced class and genre distributions, and standardized train/validation/test splits. Systematic evaluation reveals substantial performance degradation under cross-genre settings, with news texts proving most challenging to detect. Extensive experiments compare traditional classifiers, BERT-based models, and LLM-based zero-/few-shot approaches; fine-tuned BERT achieves the best overall accuracy, yet cross-genre robustness remains limited. This work establishes a reproducible benchmark for Arabic AIGC detection, delivers empirical insights into genre-specific detection difficulty, and offers methodological guidance for future research.

Technology Category

Application Category

📝 Abstract

We introduce ALHD, the first large-scale comprehensive Arabic dataset explicitly designed to distinguish between human- and LLM-generated texts. ALHD spans three genres (news, social media, reviews), covering both MSA and dialectal Arabic, and contains over 400K balanced samples generated by three leading LLMs and originated from multiple human sources, which enables studying generalizability in Arabic LLM-genearted text detection. We provide rigorous preprocessing, rich annotations, and standardized balanced splits to support reproducibility. In addition, we present, analyze and discuss benchmark experiments using our new dataset, in turn identifying gaps and proposing future research directions. Benchmarking across traditional classifiers, BERT-based models, and LLMs (zero-shot and few-shot) demonstrates that fine-tuned BERT models achieve competitive performance, outperforming LLM-based models. Results are however not always consistent, as we observe challenges when generalizing across genres; indeed, models struggle to generalize when they need to deal with unseen patterns in cross-genre settings, and these challenges are particularly prominent when dealing with news articles, where LLM-generated texts resemble human texts in style, which opens up avenues for future research. ALHD establishes a foundation for research related to Arabic LLM-detection and mitigating risks of misinformation, academic dishonesty, and cyber threats.

Problem

Research questions and friction points this paper is trying to address.

Detecting LLM-generated Arabic texts across multiple genres and dialects

Addressing generalization challenges in cross-genre Arabic text detection

Mitigating misinformation risks from Arabic LLM-generated content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale Arabic dataset for LLM detection

Multi-genre coverage with MSA and dialects

Fine-tuned BERT models outperform LLM-based detection

🔎 Similar Papers

No similar papers found.