Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This paper addresses data scarcity in generative language modeling for low-resource languages (LRLs) and the resulting exacerbation of linguistic inequity. We systematically review 54 studies and propose the first taxonomy of data augmentation techniques tailored to LRL generative modeling, covering monolingual augmentation, back-translation, multilingual pretraining, prompt engineering, and cross-lingual transfer—integrated with linguistic genealogy and evaluation trend analysis. Our analysis identifies three critical bottlenecks: (1) methodological concentration on a few models (e.g., mT5, XGLM); (2) severe imbalance in language coverage, with only 12 LRLs dominating the literature; and (3) absence of standardized, comparable evaluation protocols. The primary contribution is a set of empirically grounded, scalable, and equitable improvement pathways—balancing technical feasibility with linguistic inclusivity—to advance fair and robust generative AI for underrepresented languages.

Technology Category

Application Category

📝 Abstract

Generative language modelling has surged in popularity with the emergence of services such as ChatGPT and Google Gemini. While these models have demonstrated transformative potential in productivity and communication, they overwhelmingly cater to high-resource languages like English. This has amplified concerns over linguistic inequality in natural language processing (NLP). This paper presents the first systematic review focused specifically on strategies to address data scarcity in generative language modelling for low-resource languages (LRL). Drawing from 54 studies, we identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training, and prompt engineering, across generative tasks. We also analyse trends in architecture choices, language family representation, and evaluation methods. Our findings highlight a strong reliance on transformer-based models, a concentration on a small subset of LRLs, and a lack of consistent evaluation across studies. We conclude with recommendations for extending these methods to a wider range of LRLs and outline open challenges in building equitable generative language systems. Ultimately, this review aims to support researchers and developers in building inclusive AI tools for underrepresented languages, a necessary step toward empowering LRL speakers and the preservation of linguistic diversity in a world increasingly shaped by large-scale language technologies.

Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in generative language models for low-resource languages

Evaluating technical approaches to improve generative tasks for underrepresented languages

Identifying challenges in building equitable AI tools for linguistic diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monolingual data augmentation for low-resource languages

Back-translation techniques to enhance data availability

Multilingual training with transformer-based architectures

🔎 Similar Papers

No similar papers found.