🤖 AI Summary
To address the significant performance gap of large language models (LLMs) on direct non-English multilingual translation (x2x) compared to English-centric translation tasks, this paper proposes an English-anchored multilingual translation paradigm. It leverages high-quality English–Chinese bilingual data as a pivot to synthesize high-fidelity multilingual dialogue corpora and introduces an English-reference-based automatic quality evaluation agent to jointly optimize and transfer x2x translation capabilities. Integrating synthetic data generation, multilingual parallel corpus expansion, and preference learning, our approach is the first to systematically generalize LLMs’ English-centric translation competence to non-English directions. Experiments demonstrate substantial improvements across 72 x2x translation directions on mainstream LLMs, while also boosting English-to-X and X-to-English performance. All generated data and fine-tuned models are publicly released.
📝 Abstract
Large language models (LLMs) have demonstrated strong machine translation capabilities for English-centric language pairs but underperform in direct non-English (x2x) translation. This work addresses this limitation through a synthetic data generation framework that leverages models' established English-to-x (en2x) capabilities. By extending English parallel corpora into omnidirectional datasets and developing an English-referenced quality evaluation proxy, we enable effective collection of high-quality x2x training data. Combined with preference-based optimization, our method achieves significant improvement across 72 x2x directions for widely used LLMs, while generalizing to enhance en2x performance. The results demonstrate that strategic exploitation of English-centric strengths can bootstrap comprehensive multilingual translation capabilities in LLMs. We release codes, datasets, and model checkpoints at https://github.com/NJUNLP/EAX