🤖 AI Summary
To address the performance limitations of pretrained language models for low-resource languages—specifically isiXhosa, hindered by severe scarcity of training data—this work pioneers the application of the data-efficient BabyLM paradigm to isiXhosa. Leveraging a minimal monolingual corpus (<1 GB), we pretrain two lightweight architectures: ELC-BERT and MLSM. These models are then fine-tuned and evaluated on part-of-speech (POS) tagging and named entity recognition (NER). Results show a +3.2-point improvement in NER F1 score, with certain configurations outperforming multilingual XLM-R. The study reveals that architectural design critically influences representation learning in low-resource settings. It empirically validates the practical viability of BabyLM for under-resourced languages while underscoring the acute shortage of high-quality monolingual pretraining data for isiXhosa.
📝 Abstract
The BabyLM challenge called on participants to develop sample-efficient language models. Submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development (<100m). The challenge produced new architectures for data-efficient language modelling, which outperformed models trained on trillions of words. This is promising for low-resource languages, where available corpora are limited to much less than 100m words. In this paper, we explore the potential of BabyLMs for low-resource languages, using the isiXhosa language as a case study. We pretrain two BabyLM architectures, ELC-BERT and MLSM, on an isiXhosa corpus. They outperform a vanilla pretrained model on POS tagging and NER, achieving notable gains (+3.2 F1) for the latter. In some instances, the BabyLMs even outperform XLM-R. Our findings show that data-efficient models are viable for low-resource languages, but highlight the continued importance, and lack of, high-quality pretraining data. Finally, we visually analyse how BabyLM architectures encode isiXhosa.