Scaling Low-Resource MT via Synthetic Data Generation with LLMs

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of parallel corpora in low-resource machine translation (MT). We propose a document-level synthetic data construction method leveraging large language models (LLMs), the first systematic validation of LLMs for non-English-centric MT across 7 target languages and 147 language pairs. Our approach innovatively incorporates multi-hop pivoting to scale synthetic data generation and releases SynOPUS—a high-quality, open-source synthetic dataset. The methodology integrates LLM-based text generation, document-level contextual modeling, automated filtering, and human verification. Experimental results demonstrate that, despite inherent noise, the synthetic data consistently improves low-resource MT performance, outperforming the HPLT baseline across all evaluated language pairs. Gains are validated through both automatic metrics (BLEU, chrF) and human evaluation.

Technology Category

Application Category

📝 Abstract
We investigate the potential of LLM-generated synthetic data for improving low-resource machine translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its high overall quality. We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, and (iii) testing its utility beyond English-centric MT. Finally, we introduce SynOPUS, a public repository for synthetic parallel datasets. Our findings show that LLM-generated synthetic data, even when noisy, can substantially improve MT performance for low-resource languages.
Problem

Research questions and friction points this paper is trying to address.

Improving low-resource machine translation via LLM-generated synthetic data
Evaluating synthetic data quality and effectiveness for diverse languages
Expanding synthetic data utility beyond English-centric translation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated synthetic data for MT
Document-level corpus from Europarl
Public repository SynOPUS for datasets
🔎 Similar Papers
No similar papers found.