🤖 AI Summary
This study investigates the root causes of performance disparities in fine-tuning machine translation models for extremely low-resource indigenous languages, specifically examining whether deep linguistic structural differences impede pretrained model adaptability. Method: Leveraging two typologically divergent Brazilian indigenous languages—Tupinambá (agglutinative, SOV) and Kaingang (fusional, VOS)—we construct small-scale bilingual corpora and conduct controlled bidirectional translation experiments, systematically varying data cleaning protocols, model size, and pretraining foundations. Contribution/Results: We find that morphological, syntactic, and word-order differences constitute the primary determinant of fine-tuning performance variance; by contrast, data quality, model capacity, and corpus size exert negligible influence. This reveals an inherent limitation of prevailing fine-tuning paradigms when adapting to highly structurally heterogeneous languages, providing critical theoretical insights and practical guidance for designing robust low-resource translation systems.
📝 Abstract
Finetuning pre-trained language models with small amounts of data is a commonly-used method to create translators for ultra-low resource languages such as endangered Indigenous languages. However, previous works have reported substantially different performances with translators created using similar methodology and data. In this work we systematically explored possible causes of the performance difference, aiming to determine whether it was a product of different cleaning procedures, limitations of the pre-trained models, the size of the base model, or the size of the training dataset, studying both directions of translation. Our studies, using two Brazilian Indigenous languages, related but with significant structural linguistic characteristics, indicated none or very limited influence from those training factors, suggesting differences between languages may play a significant role in the ability to produce translators by fine-tuning pre-trained models.