Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

This work addresses the challenge of low-resource machine translation, where performance is hindered by the scarcity of high-quality parallel corpora. To this end, the authors propose LALITA, a novel framework that systematically leverages lexical and linguistic features of source sentences—such as syntactic complexity—to identify and select high-value training samples. Combined with synthetic data augmentation, this approach significantly reduces data requirements while improving translation quality. The method is evaluated across multiple low-resource languages, including Hindi, Odia, Nepali, Nynorsk, and German, consistently yielding performance gains across training sets ranging from 50K to 800K sentence pairs. Notably, LALITA achieves these improvements with over 50% less training data, demonstrating both its efficiency and strong generalization capability.

Technology Category

Application Category

📝 Abstract

Data curation is a critical yet under-researched step in the machine translation training paradigm. To train translation systems, data acquisition relies primarily on human translations and digital parallel sources or, to a limited degree, synthetic generation. But, for low-resource languages, human translation to generate sufficient data is prohibitively expensive. Therefore, it is crucial to develop a framework that screens source sentences to form efficient parallel text, ensuring optimal MT system performance in low-resource environments. We approach this by evaluating English-Hindi bi-text to determine effective sentence selection strategies for optimal MT system training. Our extensively tested framework, (Lexical And Linguistically Informed Text Analysis) LALITA, targets source sentence selection using lexical and linguistic features to curate parallel corpora. We find that by training mostly on complex sentences from both existing and synthetic datasets, our method significantly improves translation quality. We test this by simulating low-resource data availabilty with curated datasets of 50K to 800K English sentences and report improved performances on all data sizes. LALITA demonstrates remarkable efficiency, reducing data needs by more than half across multiple languages (Hindi, Odia, Nepali, Norwegian Nynorsk, and German). This approach not only reduces MT systems training cost by reducing training data requirement, but also showcases LALITA's utility in data augmentation.

Problem

Research questions and friction points this paper is trying to address.

low-resource machine translation

parallel corpus

data curation

source sentence selection

training data efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

data curation

low-resource machine translation

sentence selection