Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

πŸ“… 2026-03-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of mutual interference among the six low-resource Romansh dialects during synthetic data generation with large language models, which degrades machine translation performance. The authors propose a novel data augmentation strategy that aligns the direction of synthetic data generation with the resource gradient between source and target languages, thereby guiding the large language model to produce high-quality, dialect-specific synthetic data. This targeted approach effectively mitigates cross-dialect confusion and significantly improves translation quality across all variants. Notably, on the lowest-resource dialect, the method outperforms Gemini 3 Pro by 23 BLEU points. Human evaluations further confirm substantial gains in both fluency and accuracy of the generated translations.

Technology Category

Application Category

πŸ“ Abstract
Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.
Problem

Research questions and friction points this paper is trying to address.

low-resource machine translation
Romansh language varieties
translation asymmetry
synthetic data generation
language confusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

translation asymmetry
data augmentation
low-resource machine translation
Romansh language varieties
LLM-based synthetic data
πŸ”Ž Similar Papers
No similar papers found.
Jannis Vamvas
Jannis Vamvas
University of Zurich
I
Ignacio PΓ©rez Prat
Lia Rumantscha
A
Angela Heldstab
University of Zurich
D
Dominic P. Fischer
University of Zurich
Sina Ahmadi
Sina Ahmadi
University of Zurich
Natural Language ProcessingComputational Linguistics
R
Rico Sennrich
University of Zurich