Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge of mutual interference among the six low-resource Romansh dialects during synthetic data generation with large language models, which degrades machine translation performance. The authors propose a novel data augmentation strategy that aligns the direction of synthetic data generation with the resource gradient between source and target languages, thereby guiding the large language model to produce high-quality, dialect-specific synthetic data. This targeted approach effectively mitigates cross-dialect confusion and significantly improves translation quality across all variants. Notably, on the lowest-resource dialect, the method outperforms Gemini 3 Pro by 23 BLEU points. Human evaluations further confirm substantial gains in both fluency and accuracy of the generated translations.

Technology Category

Application Category

📝 Abstract

Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.

Problem

Research questions and friction points this paper is trying to address.

low-resource machine translation

Romansh language varieties

translation asymmetry

synthetic data generation

language confusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

translation asymmetry

data augmentation

low-resource machine translation