NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments

📅 2026-03-14

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study addresses the longstanding scarcity of high-quality parallel corpora for low-resource South Asian languages, specifically Nepali and Tamang, which has hindered machine translation progress. We present the first Nepali–Tamang parallel corpus spanning five domains—agriculture, health, education, culture, and governance—comprising 20K human-translated and 80K synthetically generated sentence pairs refined through semantic filtering. An expert validation protocol was implemented to ensure data quality. Leveraging this resource, we fine-tune several multilingual models, including mBART, M2M-100, NLLB-200, and Transformer, with NLLB-200 achieving the best performance: sacreBLEU scores of 40.92 for Nepali→Tamang and 45.26 for the reverse direction, substantially mitigating the data gap for this language pair.

Technology Category

Application Category

📝 Abstract

Modern Translation Systems heavily rely on high-quality, large parallel datasets for state-of-the-art performance. However, such resources are largely unavailable for most of the South Asian languages. Among them, Nepali and Tamang fall into such category, with Tamang being among the least digitally resourced languages in the region. This work addresses the gap by developing NepTam20K, a 20K gold standard parallel corpus, and NepTam80K, an 80K synthetic Nepali-Tamang parallel corpus, both sentence-aligned and designed to support machine translation. The datasets were created through a pipeline involving data scraping from Nepali news and online sources, pre-processing, semantic filtering, balancing for tense and polarity (in NepTam20K dataset), expert translation into Tamang by native speakers of the language, and verification by an expert Tamang linguist. The dataset covers five domains: Agriculture, Health, Education and Technology, Culture, and General Communication. To evaluate the dataset, baseline machine translation experiments were carried out using various multilingual pre-trained models: mBART, M2M-100, NLLB-200, and a vanilla Transformer model. The fine-tuning on the NLLB-200 achieved the highest sacreBLEU scores of 40.92 (Nepali-Tamang) and 45.26 (Tamang-Nepali).

Problem

Research questions and friction points this paper is trying to address.

low-resource languages

Nepali-Tamang translation

parallel corpus scarcity

machine translation

South Asian languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

parallel corpus

low-resource machine translation

Nepali-Tamang