Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands Māori

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical text normalization challenge of automatic diacritic (including tone mark) restoration for extremely low-resource indigenous languages—specifically Bribri (Costa Rica) and Cook Islands Māori. We propose a character-level fine-tuning approach for large language models (LLMs), incorporating UTF-8 byte-level tokenization to better capture orthographic nuances. Systematic comparisons are conducted against multiple baseline algorithms and multilingual LLMs. Our empirical results, the first of their kind for these languages, demonstrate that merely ~10K annotated tokens suffice to achieve stable, high performance—significantly outperforming zero-shot methods, which fail entirely. Character-level fine-tuning consistently surpasses general-purpose multilingual LLMs on complex diacritic restoration tasks. The work bridges methodological innovation with practical impact, directly fulfilling language communities’ urgent need for robust, accessible text standardization tools.

Technology Category

Application Category

📝 Abstract
We present experiments on diacritic restoration, a form of text normalization essential for natural language processing (NLP) tasks. Our study focuses on two extremely under-resourced languages: Bribri, a Chibchan language spoken in Costa Rica, and Cook Islands Māori, a Polynesian language spoken in the Cook Islands. Specifically, this paper: (i) compares algorithms for diacritics restoration in under-resourced languages, including tonal diacritics, (ii) examines the amount of data required to achieve target performance levels, (iii) contrasts results across varying resource conditions, and (iv) explores the related task of diacritic correction. We find that fine-tuned, character-level LLMs perform best, likely due to their ability to decompose complex characters into their UTF-8 byte representations. In contrast, massively multilingual models perform less effectively given our data constraints. Across all models, reliable performance begins to emerge with data budgets of around 10,000 words. Zero-shot approaches perform poorly in all cases. This study responds both to requests from the language communities and to broader NLP research questions concerning model performance and generalization in under-resourced contexts.
Problem

Research questions and friction points this paper is trying to address.

Develops diacritic restoration for under-resourced indigenous languages
Compares algorithms and data needs for text normalization tasks
Evaluates model performance in low-resource linguistic contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned character-level LLMs for diacritic restoration
Using UTF-8 byte representations to decompose complex characters
Requiring around 10,000 words for reliable performance
🔎 Similar Papers
No similar papers found.
Rolando Coto-Solano
Rolando Coto-Solano
Dartmouth College
D
Daisy Li
Dartmouth College
M
Manoela Teleginski Ferraz
Dartmouth College
O
Olivia Sasse
Dartmouth College
C
Cha Krupka
Dartmouth College
Sharid Loáiciga
Sharid Loáiciga
University of Gothenburg
reference and coreferencediscoursesmall modelsmachine translation
S
Sally Akevai Tenamu Nicholas
The University of Auckland (Waipapa Taumata Rau)