Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in multilingual translation—including complex language patterns and unnatural outputs—this paper proposes an optimization framework for 7B-parameter large language models. Methodologically, it integrates high-quality monolingual/bilingual mixed pretraining, chain-of-thought-guided instruction tuning, and reinforcement learning–based optimization of translation generalization across 28 languages. Crucially, it explicitly models reasoning steps within translation instructions and enhances cross-lingual consistency via a multilingual-aligned reward mechanism. Experimental results demonstrate significant improvements over comparable open-source models (e.g., Qwen2-7B, LLaMA3-8B) in both automatic metrics (BLEU, COMET) and human evaluations (fluency, faithfulness), achieving performance on par with Gemini-2.5 and GPT-4o. The model and training paradigm are fully open-sourced, establishing a new benchmark and practical pathway for efficient multilingual translation research.

Technology Category

Application Category

📝 Abstract
Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.
Problem

Research questions and friction points this paper is trying to address.

Addressing intricate language patterns in multilingual translation
Overcoming stilted translations in automated language models
Enhancing translation quality across 28 diverse languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

7B parameter multilingual LLM for translation
Chain-of-Thought reasoning for instruct model
Reinforcement learning enhances generalization
🔎 Similar Papers
No similar papers found.
Shanbo Cheng
Shanbo Cheng
ByteDance Seed
LLMsMLNLPMachine TranslationMulti modal
Y
Yu Bao
ByteDance Seed
Q
Qian Cao
ByteDance Seed
Luyang Huang
Luyang Huang
Bytedance
Natural Language Processing
Liyan Kang
Liyan Kang
Bytedance
Neural Machine Translation
Z
Zhicheng Liu
ByteDance Seed
Y
Yu Lu
ByteDance Seed
Wenhao Zhu
Wenhao Zhu
ByteDance Seed
Large Language ModelMachine Translation
Z
Zhichao Huang
ByteDance Seed
T
Tao Li
ByteDance Seed
Sitong Liu
Sitong Liu
Duke University
Ningxin Peng
Ningxin Peng
ByteDance Research
Shuaijie She
Shuaijie She
National Key Laboratory for Novel Software Technology, Nanjing University
ReasoningAlignmentMultilingual
Lu Xu
Lu Xu
Postdoc, Riken AIP
deep learningmachine learningcomputer vision
N
Nuo Xu
ByteDance Seed
S
Sen Yang
ByteDance Seed
Runsheng Yu
Runsheng Yu
Unknown affiliation
Y
Yiming Yu
ByteDance Seed
L
Liehao Zou
ByteDance Seed
H
Hang Li
ByteDance Seed
L
Lu Lu
ByteDance Seed
Y
Yuxuan Wang
ByteDance Seed
Y
Yonghui Wu
ByteDance Seed