🤖 AI Summary
This study investigates the effectiveness of large language models (LLMs) on diacritization—a critical morphophonological task—in typologically divergent, low-resource languages: Arabic and Yoruba. We introduce MultiDiac, the first cross-lingual, multilingual benchmark for diacritization, enabling systematic evaluation of 14 general-purpose LLMs and 6 task-specific models. We further conduct LoRA-based lightweight fine-tuning on four small open-source LLMs. To our knowledge, this is the first fair, cross-lingual comparison between LLMs and specialized models on diacritization for both languages. Results show that most off-the-shelf LLMs outperform dedicated models in zero-shot and few-shot settings; LoRA fine-tuning reduces error rates by up to 37% and significantly mitigates hallucination. Key contributions include: (1) the publicly released MultiDiac dataset; (2) a rigorous cross-lingual evaluation framework; and (3) empirical validation that lightweight adaptation effectively suppresses hallucination in low-resource diacritization.
📝 Abstract
We investigate the effectiveness of large language models (LLMs) for text diacritization in two typologically distinct languages: Arabic and Yoruba. To enable a rigorous evaluation, we introduce a novel multilingual dataset MultiDiac, with diverse samples that capture a range of diacritic ambiguities. We evaluate 14 LLMs varying in size, accessibility, and language coverage, and benchmark them against 6 specialized diacritization models. Additionally, we fine-tune four small open-source models using LoRA for Yoruba. Our results show that many off-the-shelf LLMs outperform specialized diacritization models for both Arabic and Yoruba, but smaller models suffer from hallucinations. Fine-tuning on a small dataset can help improve diacritization performance and reduce hallucination rates.