Scaling Low-Resource MT via Synthetic Data Generation with LLMs

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This study addresses the scarcity of parallel corpora in low-resource machine translation (MT). We propose a document-level synthetic data construction method leveraging large language models (LLMs), the first systematic validation of LLMs for non-English-centric MT across 7 target languages and 147 language pairs. Our approach innovatively incorporates multi-hop pivoting to scale synthetic data generation and releases SynOPUS—a high-quality, open-source synthetic dataset. The methodology integrates LLM-based text generation, document-level contextual modeling, automated filtering, and human verification. Experimental results demonstrate that, despite inherent noise, the synthetic data consistently improves low-resource MT performance, outperforming the HPLT baseline across all evaluated language pairs. Gains are validated through both automatic metrics (BLEU, chrF) and human evaluation.

Technology Category

Application Category

📝 Abstract

We investigate the potential of LLM-generated synthetic data for improving low-resource machine translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its high overall quality. We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, and (iii) testing its utility beyond English-centric MT. Finally, we introduce SynOPUS, a public repository for synthetic parallel datasets. Our findings show that LLM-generated synthetic data, even when noisy, can substantially improve MT performance for low-resource languages.

Problem

Research questions and friction points this paper is trying to address.

Improving low-resource machine translation via LLM-generated synthetic data

Evaluating synthetic data quality and effectiveness for diverse languages

Expanding synthetic data utility beyond English-centric translation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated synthetic data for MT

Document-level corpus from Europarl

Public repository SynOPUS for datasets

🔎 Similar Papers

No similar papers found.