YAD: Leveraging T5 for Improved Automatic Diacritization of Yor`ub'a Text

📅 2024-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the automatic diacritization challenge for Yoruba—a low-resource tonal language—by introducing YAD, the first dedicated benchmark dataset for Yoruba diacritization, and Yoruba-T5, the first monolingual T5 model pretrained specifically for Yoruba. Methodologically, we adopt the T5 text-to-text framework, performing monolingual pretraining on Yoruba corpora followed by supervised fine-tuning on YAD, with systematic evaluation of diacritization performance. Our key contributions are: (1) releasing YAD—a high-quality, human-verified diacritization benchmark; (2) open-sourcing the first monolingual T5 model designed explicitly for tonal language diacritization; and (3) empirically demonstrating that both model capacity and training data scale exert significant positive effects on diacritization accuracy. Experiments show that Yoruba-T5 substantially outperforms the multilingual mT5 baseline on YAD, establishing a reproducible, language-specific paradigm for phonological normalization in low-resource tonal languages.

Technology Category

Application Category

📝 Abstract
In this work, we present Yor`ub'a automatic diacritization (YAD) benchmark dataset for evaluating Yor`ub'a diacritization systems. In addition, we pre-train text-to-text transformer, T5 model for Yor`ub'a and showed that this model outperform several multilingually trained T5 models. Lastly, we showed that more data and larger models are better at diacritization for Yor`ub'a
Problem

Research questions and friction points this paper is trying to address.

Automatic Phonetics
Yoruba Language
Pronunciation Accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

T5 Model
Yoruba Language
Automatic Accentuation
🔎 Similar Papers
No similar papers found.