DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current machine translation evaluation primarily focuses on segment-level accuracy and fluency, failing to adequately assess terminology consistency and cross-sentential coherence—critical requirements for domain-specific document-level translation. To address this gap, we introduce DiscoX, the first academic communication-oriented Chinese–English document-level translation benchmark, comprising 200 long documents (avg. >1,700 tokens) across seven specialized domains. We further propose Metric-S, a reference-free, fine-grained automatic evaluation framework that jointly models accuracy, fluency, and appropriateness, achieving strong correlation with human judgments (ρ = 0.82). Empirical results show that state-of-the-art large language models underperform significantly relative to human experts on DiscoX, confirming its high difficulty. This work establishes a reproducible, scalable, and domain-aware evaluation paradigm for high-quality professional translation research.

Technology Category

Application Category

📝 Abstract
The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating discourse-level translation in expert domains remains inadequate
Current methods lack coherence and terminological precision assessment
Advanced LLMs still significantly trail human expert performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

DiscoX benchmark for discourse-level expert translation
Metric-S reference-free fine-grained automatic evaluation system
Strong metric-human judgment consistency outperforming existing methods
🔎 Similar Papers
No similar papers found.
X
Xiying Zhao
ByteDance Seed, Peking University
Zhoufutu Wen
Zhoufutu Wen
ByteDance SEED
LLM Evaluation
Z
Zhixuan Chen
ByteDance Seed, Peking University
J
Jingzhe Ding
ByteDance Seed, Peking University
J
Jianpeng Jiao
ByteDance Seed, Peking University
S
Shuai Li
ByteDance Seed, Peking University
X
Xi Li
ByteDance Seed, Peking University
D
Danni Liang
ByteDance Seed, Peking University
S
Shengda Long
ByteDance Seed, Peking University
Q
Qianqian Liu
ByteDance Seed, Peking University
X
Xianbo Wu
ByteDance Seed, Peking University
H
Hongwan Gao
ByteDance Seed, Peking University
X
Xiang Gao
ByteDance Seed, Peking University
L
Liang Hu
ByteDance Seed, Peking University
Jiashuo Liu
Jiashuo Liu
Tsinghua University
Robust OptimizationOOD GeneralizationData-Centric AI
M
Mengyun Liu
ByteDance Seed, Peking University
W
Weiran Shi
ByteDance Seed, Peking University
Chenghao Yang
Chenghao Yang
University of Chicago
Human-AI AlignmentNLPMLCommunication & Intelligence
Q
Qianyu Yang
ByteDance Seed, Peking University
Xuanliang Zhang
Xuanliang Zhang
Harbin Institute of Technology
Natural Language ProcessSemantic ParsingTable Reasoning
G
Ge Zhang
ByteDance Seed, Peking University
W
Wenhao Huang
ByteDance Seed, Peking University