🤖 AI Summary
To address the challenges of limited scale, poor quality, and narrow topical coverage in parallel data for low-resource language machine translation (MT), this paper proposes TopXGen: a two-stage controllable data synthesis framework. First, large language models (LLMs) generate diverse, natural, and topic-rich source-side texts in high-resource languages; second, high-quality synthetic parallel corpora are constructed via back-translation. Unlike conventional back-translation, TopXGen eliminates reliance on authentic target-language data, establishing the first “high-resource language generation → back-translation” paradigm. Experiments demonstrate that TopXGen significantly improves translation performance under both supervised fine-tuning and in-context learning (average BLEU gain of +2.8), while enhancing model generalization and topical coverage. This approach provides a scalable, high-fidelity data augmentation pathway for MT in data-scarce scenarios.
📝 Abstract
LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present extsc{TopXGen}, an LLM-based approach for the generation of high quality and topic-diverse data in multiple LRLs, which can then be backtranslated to produce useful and diverse parallel texts for ICL and fine-tuning. Our intuition is that while LLMs struggle to translate into LRLs, their ability to translate well into HRLs and their multilinguality enable them to generate good quality, natural-sounding target-side texts, which can be translated well into a high-resource source language. We show that extsc{TopXGen} boosts LLM translation performance during fine-tuning and in-context learning. Code and outputs are available at https://github.com/ArmelRandy/topxgen.