Are LLMs Good Text Diacritizers? An Arabic and Yor`ub'a Case Study

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study investigates the effectiveness of large language models (LLMs) on diacritization—a critical morphophonological task—in typologically divergent, low-resource languages: Arabic and Yoruba. We introduce MultiDiac, the first cross-lingual, multilingual benchmark for diacritization, enabling systematic evaluation of 14 general-purpose LLMs and 6 task-specific models. We further conduct LoRA-based lightweight fine-tuning on four small open-source LLMs. To our knowledge, this is the first fair, cross-lingual comparison between LLMs and specialized models on diacritization for both languages. Results show that most off-the-shelf LLMs outperform dedicated models in zero-shot and few-shot settings; LoRA fine-tuning reduces error rates by up to 37% and significantly mitigates hallucination. Key contributions include: (1) the publicly released MultiDiac dataset; (2) a rigorous cross-lingual evaluation framework; and (3) empirical validation that lightweight adaptation effectively suppresses hallucination in low-resource diacritization.

Technology Category

Application Category

📝 Abstract

We investigate the effectiveness of large language models (LLMs) for text diacritization in two typologically distinct languages: Arabic and Yoruba. To enable a rigorous evaluation, we introduce a novel multilingual dataset MultiDiac, with diverse samples that capture a range of diacritic ambiguities. We evaluate 14 LLMs varying in size, accessibility, and language coverage, and benchmark them against 6 specialized diacritization models. Additionally, we fine-tune four small open-source models using LoRA for Yoruba. Our results show that many off-the-shelf LLMs outperform specialized diacritization models for both Arabic and Yoruba, but smaller models suffer from hallucinations. Fine-tuning on a small dataset can help improve diacritization performance and reduce hallucination rates.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for Arabic and Yoruba text diacritization

Introducing MultiDiac dataset for multilingual diacritic evaluation

Comparing LLMs with specialized diacritization models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual dataset MultiDiac for evaluation

Evaluated 14 LLMs against specialized models

Fine-tuned small models using LoRA technique

🔎 Similar Papers

No similar papers found.