TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

To address the challenges of limited scale, poor quality, and narrow topical coverage in parallel data for low-resource language machine translation (MT), this paper proposes TopXGen: a two-stage controllable data synthesis framework. First, large language models (LLMs) generate diverse, natural, and topic-rich source-side texts in high-resource languages; second, high-quality synthetic parallel corpora are constructed via back-translation. Unlike conventional back-translation, TopXGen eliminates reliance on authentic target-language data, establishing the first “high-resource language generation → back-translation” paradigm. Experiments demonstrate that TopXGen significantly improves translation performance under both supervised fine-tuning and in-context learning (average BLEU gain of +2.8), while enhancing model generalization and topical coverage. This approach provides a scalable, high-fidelity data augmentation pathway for MT in data-scarce scenarios.

Technology Category

Application Category

📝 Abstract

LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present extsc{TopXGen}, an LLM-based approach for the generation of high quality and topic-diverse data in multiple LRLs, which can then be backtranslated to produce useful and diverse parallel texts for ICL and fine-tuning. Our intuition is that while LLMs struggle to translate into LRLs, their ability to translate well into HRLs and their multilinguality enable them to generate good quality, natural-sounding target-side texts, which can be translated well into a high-resource source language. We show that extsc{TopXGen} boosts LLM translation performance during fine-tuning and in-context learning. Code and outputs are available at https://github.com/ArmelRandy/topxgen.

Problem

Research questions and friction points this paper is trying to address.

Generating topic-diverse parallel data for low-resource machine translation

Overcoming limitations of existing parallel datasets in size and quality

Enhancing LLM translation performance via synthetic data generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based topic-diverse data generation

Backtranslation for low-resource languages

Enhances fine-tuning and in-context learning

🔎 Similar Papers

No similar papers found.