LLMs for Extremely Low-Resource Finno-Ugric Languages

📅 2024-10-24

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Large language models (LLMs) remain critically underdeveloped for ultra-low-resource Uralic languages—such as Võro, Livonian, and Komi—due to severe data scarcity and lack of standardized evaluation. Method: We establish an end-to-end technical pipeline encompassing data curation, multilingual pretraining, instruction fine-tuning, and multidimensional evaluation. Key innovations include cross-lingual data augmentation, quality-aware filtering, and a human-in-the-loop evaluation framework to enhance cultural appropriateness and grammatical accuracy. Contribution/Results: We introduce smugri-MT-bench—the first multi-turn dialogue benchmark for the Uralic language family—and publicly release the first Uralic-adapted multilingual foundation model and its instruction-tuned variants. Our open-source paradigm covers datasets, models, automated metrics, and human evaluations. We release multiple parameter-scale models alongside an authoritative benchmark dataset containing 3,000+ samples, achieving substantial gains over zero-shot transfer baselines.

Technology Category

Application Category

📝 Abstract

The advancement of large language models (LLMs) has predominantly focused on high-resource languages, leaving low-resource languages, such as those in the Finno-Ugric family, significantly underrepresented. This paper addresses this gap by focusing on V~oro, Livonian, and Komi. We cover almost the entire cycle of LLM creation, from data collection to instruction tuning and evaluation. Our contributions include developing multilingual base and instruction-tuned models; creating evaluation benchmarks, including the smugri-MT-bench multi-turn conversational benchmark; and conducting human evaluation. We intend for this work to promote linguistic diversity, ensuring that lesser-resourced languages can benefit from advancements in NLP.

Problem

Research questions and friction points this paper is trying to address.

Addressing underrepresentation of Finno-Ugric languages in LLMs

Developing LLMs for Võro, Livonian, and Komi languages

Creating benchmarks and models for low-resource language NLP

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual base and instruction-tuned models development

Creation of evaluation benchmarks including smugri-MT-bench

Comprehensive human evaluation for linguistic diversity

🔎 Similar Papers

Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis