π€ AI Summary
This work addresses the challenge of mutual interference among the six low-resource Romansh dialects during synthetic data generation with large language models, which degrades machine translation performance. The authors propose a novel data augmentation strategy that aligns the direction of synthetic data generation with the resource gradient between source and target languages, thereby guiding the large language model to produce high-quality, dialect-specific synthetic data. This targeted approach effectively mitigates cross-dialect confusion and significantly improves translation quality across all variants. Notably, on the lowest-resource dialect, the method outperforms Gemini 3 Pro by 23 BLEU points. Human evaluations further confirm substantial gains in both fluency and accuracy of the generated translations.
π Abstract
Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.