🤖 AI Summary
Current machine translation evaluation primarily focuses on segment-level accuracy and fluency, failing to adequately assess terminology consistency and cross-sentential coherence—critical requirements for domain-specific document-level translation. To address this gap, we introduce DiscoX, the first academic communication-oriented Chinese–English document-level translation benchmark, comprising 200 long documents (avg. >1,700 tokens) across seven specialized domains. We further propose Metric-S, a reference-free, fine-grained automatic evaluation framework that jointly models accuracy, fluency, and appropriateness, achieving strong correlation with human judgments (ρ = 0.82). Empirical results show that state-of-the-art large language models underperform significantly relative to human experts on DiscoX, confirming its high difficulty. This work establishes a reproducible, scalable, and domain-aware evaluation paradigm for high-quality professional translation research.
📝 Abstract
The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation.