TransBench: Benchmarking Machine Translation for Industrial-Scale Applications

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing machine translation (MT) evaluation benchmarks lack industrial relevance, failing to adequately assess critical capabilities such as domain-specific terminology handling, cultural adaptation, and stylistic consistency. To address this gap, we introduce TransBench—the first open-source, industry-grade MT benchmark tailored for international e-commerce. It encompasses four realistic deployment scenarios, 33 language pairs, and 17,000 professionally curated reference translations. We propose a three-dimensional evaluation framework: “Linguistic Foundation → Domain Expertise → Cultural Adaptation.” Innovatively, we design Marco-MOS, a domain-aware metric that integrates traditional automatic metrics with domain-specific knowledge. We also release a reproducible benchmark construction methodology and an automated evaluation toolchain. Experiments reveal that state-of-the-art LLMs underperform professional human translators by 23.6% on average in cultural adaptation. TransBench has been adopted by multiple industry partners for MT system selection and optimization, effectively bridging the gap between academic evaluation and real-world performance.

Technology Category

Application Category

📝 Abstract
Machine translation (MT) has become indispensable for cross-border communication in globalized industries like e-commerce, finance, and legal services, with recent advancements in large language models (LLMs) significantly enhancing translation quality. However, applying general-purpose MT models to industrial scenarios reveals critical limitations due to domain-specific terminology, cultural nuances, and stylistic conventions absent in generic benchmarks. Existing evaluation frameworks inadequately assess performance in specialized contexts, creating a gap between academic benchmarks and real-world efficacy. To address this, we propose a three-level translation capability framework: (1) Basic Linguistic Competence, (2) Domain-Specific Proficiency, and (3) Cultural Adaptation, emphasizing the need for holistic evaluation across these dimensions. We introduce TransBench, a benchmark tailored for industrial MT, initially targeting international e-commerce with 17,000 professionally translated sentences spanning 4 main scenarios and 33 language pairs. TransBench integrates traditional metrics (BLEU, TER) with Marco-MOS, a domain-specific evaluation model, and provides guidelines for reproducible benchmark construction. Our contributions include: (1) a structured framework for industrial MT evaluation, (2) the first publicly available benchmark for e-commerce translation, (3) novel metrics probing multi-level translation quality, and (4) open-sourced evaluation tools. This work bridges the evaluation gap, enabling researchers and practitioners to systematically assess and enhance MT systems for industry-specific needs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating industrial MT lacks domain-specific benchmarks
Existing frameworks miss cultural and stylistic nuances
No holistic metrics for multi-level translation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-level framework for industrial MT evaluation
TransBench benchmark with 17,000 e-commerce sentences
Novel metrics combining BLEU, TER, and Marco-MOS
🔎 Similar Papers
No similar papers found.
Haijun Li
Haijun Li
Washington State University
ProbabilityRisk TheoryMultivariate Extremes
T
Tianqi Shi
Alibaba International Digital Commerce
Z
Zifu Shang
Alibaba International Digital Commerce
Yuxuan Han
Yuxuan Han
Tsinghua University
computer visioncomputer graphics
X
Xueyu Zhao
Alibaba International Digital Commerce
H
Hao Wang
Alibaba International Digital Commerce
Y
Yu Qian
Alibaba International Digital Commerce
Z
Zhiqiang Qian
Alibaba International Digital Commerce
L
Linlong Xu
Alibaba International Digital Commerce
M
Minghao Wu
Alibaba International Digital Commerce
Chenyang Lyu
Chenyang Lyu
Alibaba
Large Language ModelsNatural Language ProcessingMachine Learning
Longyue Wang
Longyue Wang
Alibaba International
Large Language ModelMachine TranslationNatural Language ProcessingLanguange Agent
G
Gongbo Tang
Beijing Language and Culture University
Weihua Luo
Weihua Luo
Alibaba
natural language processingmachine learningartificial intelligence
Z
Zhao Xu
Alibaba International Digital Commerce
Kaifu Zhang
Kaifu Zhang
Assistant Professor of Marketing, Carnegie Mellon University
Two-sided marketsInternet platformse-commerce