NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments

📅 2026-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the longstanding scarcity of high-quality parallel corpora for low-resource South Asian languages, specifically Nepali and Tamang, which has hindered machine translation progress. We present the first Nepali–Tamang parallel corpus spanning five domains—agriculture, health, education, culture, and governance—comprising 20K human-translated and 80K synthetically generated sentence pairs refined through semantic filtering. An expert validation protocol was implemented to ensure data quality. Leveraging this resource, we fine-tune several multilingual models, including mBART, M2M-100, NLLB-200, and Transformer, with NLLB-200 achieving the best performance: sacreBLEU scores of 40.92 for Nepali→Tamang and 45.26 for the reverse direction, substantially mitigating the data gap for this language pair.

Technology Category

Application Category

📝 Abstract
Modern Translation Systems heavily rely on high-quality, large parallel datasets for state-of-the-art performance. However, such resources are largely unavailable for most of the South Asian languages. Among them, Nepali and Tamang fall into such category, with Tamang being among the least digitally resourced languages in the region. This work addresses the gap by developing NepTam20K, a 20K gold standard parallel corpus, and NepTam80K, an 80K synthetic Nepali-Tamang parallel corpus, both sentence-aligned and designed to support machine translation. The datasets were created through a pipeline involving data scraping from Nepali news and online sources, pre-processing, semantic filtering, balancing for tense and polarity (in NepTam20K dataset), expert translation into Tamang by native speakers of the language, and verification by an expert Tamang linguist. The dataset covers five domains: Agriculture, Health, Education and Technology, Culture, and General Communication. To evaluate the dataset, baseline machine translation experiments were carried out using various multilingual pre-trained models: mBART, M2M-100, NLLB-200, and a vanilla Transformer model. The fine-tuning on the NLLB-200 achieved the highest sacreBLEU scores of 40.92 (Nepali-Tamang) and 45.26 (Tamang-Nepali).
Problem

Research questions and friction points this paper is trying to address.

low-resource languages
Nepali-Tamang translation
parallel corpus scarcity
machine translation
South Asian languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

parallel corpus
low-resource machine translation
Nepali-Tamang
synthetic data
multilingual pre-trained models
🔎 Similar Papers
No similar papers found.
R
Rupak Raj Ghimire
ILPRL, Kathmandu University, Nepal
B
Bipesh Subedi
ILPRL, Kathmandu University, Nepal
B
Balaram Prasain
Tribhuvan University, Nepal
P
Prakash Poudyal
ILPRL, Kathmandu University, Nepal
P
Praveen Acharya
Dublin City University, Ireland
N
Nischal Karki
ILPRL, Kathmandu University, Nepal
R
Rupak Tiwari
ILPRL, Kathmandu University, Nepal
R
Rishikesh Kumar Sharma
ILPRL, Kathmandu University, Nepal
J
Jenny Poudel
ILPRL, Kathmandu University, Nepal
Bal Krishna Bal
Bal Krishna Bal
Professor of Computer Engineering, Kathmandu University
Natural Language ProcessingSentiment AnalysisSoftware Localization