EnAnchored-X2X: English-Anchored Optimization for Many-to-Many Translation

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the significant performance gap of large language models (LLMs) on direct non-English multilingual translation (x2x) compared to English-centric translation tasks, this paper proposes an English-anchored multilingual translation paradigm. It leverages high-quality English–Chinese bilingual data as a pivot to synthesize high-fidelity multilingual dialogue corpora and introduces an English-reference-based automatic quality evaluation agent to jointly optimize and transfer x2x translation capabilities. Integrating synthetic data generation, multilingual parallel corpus expansion, and preference learning, our approach is the first to systematically generalize LLMs’ English-centric translation competence to non-English directions. Experiments demonstrate substantial improvements across 72 x2x translation directions on mainstream LLMs, while also boosting English-to-X and X-to-English performance. All generated data and fine-tuned models are publicly released.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated strong machine translation capabilities for English-centric language pairs but underperform in direct non-English (x2x) translation. This work addresses this limitation through a synthetic data generation framework that leverages models' established English-to-x (en2x) capabilities. By extending English parallel corpora into omnidirectional datasets and developing an English-referenced quality evaluation proxy, we enable effective collection of high-quality x2x training data. Combined with preference-based optimization, our method achieves significant improvement across 72 x2x directions for widely used LLMs, while generalizing to enhance en2x performance. The results demonstrate that strategic exploitation of English-centric strengths can bootstrap comprehensive multilingual translation capabilities in LLMs. We release codes, datasets, and model checkpoints at https://github.com/NJUNLP/EAX
Problem

Research questions and friction points this paper is trying to address.

Improving non-English translation performance in large language models
Generating synthetic training data from English parallel corpora
Enhancing multilingual translation through English-centric optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends English parallel corpora into omnidirectional datasets
Develops English-referenced quality evaluation proxy
Combines synthetic data with preference-based optimization
S
Sen Yang
National Key Laboratory for Novel Software Technology, Nanjing University
Y
Yu Bao
ByteDance Research
Y
Yu Lu
ByteDance Research
J
Jiajun Chen
National Key Laboratory for Novel Software Technology, Nanjing University
Shujian Huang
Shujian Huang
School of Computer Science, Nanjing University
Natural Language ProcessingMachine TranslationMultilingualismLarge Language Models
Shanbo Cheng
Shanbo Cheng
ByteDance Seed
LLMsMLNLPMachine TranslationMulti modal