HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the limited discriminative power of existing general-purpose Chinese–English machine translation benchmarks in knowledge-intensive domains such as finance, healthcare, and law, where system scores often saturate. To overcome this, we introduce HardMTBench, a bidirectional, difficulty-aware diagnostic benchmark spanning 12 knowledge-intensive domains. Through a three-stage pipeline that integrates multidimensional signals—including knowledge density, terminological load, and translation difficulty—and combines domain-quota sampling with large-scale human curation, we construct a high-discriminability test set comprising 20,000 items. Experiments across 22 state-of-the-art systems demonstrate that HardMTBench nearly doubles the score range of GEMBA, substantially enhancing evaluation discriminability and diagnostic capability. The benchmark effectively exposes model weaknesses in handling domain-specific terminology and knowledge, leading to significant reordering of system rankings.

📝 Abstract

General-purpose machine translation benchmarks such as FLORES-200 have reached a saturation regime on Chinese-English pairs, where modern large language models cluster within a narrow band of high scores. Across 22 systems, FLORES-200 zh-en GEMBA scores fall in a 7.87-point range with a standard deviation of 2.29, which compresses the separation between systems on knowledge-intensive domains such as finance, healthcare, law, and science and technology. We introduce HardMTBench, a difficulty-aware diagnostic benchmark for bidirectional Chinese-English domain translation. HardMTBench covers 12 domains and contains 10,000 hand-curated source sentences with reference translations, packaged as 20,000 directional test items. A three-stage construction pipeline builds a domain-balanced candidate pool of 84{,}566 pairs, applies an LLM-based multi-signal judge over knowledge density, translation difficulty, terminology load and reference correctness, and assembles the final test set under a hardness fusion rule with per-domain quotas. Across 22 systems spanning general LLMs, commercial engines and specialised MT models, HardMTBench widens the cross-system GEMBA range by roughly a factor of two over FLORES-200, induces visible rank reorderings, and exposes domain-specific terminology and knowledge weaknesses that quality-only metrics tend to flatten. All data and code are open-sourced at https://github.com/jasonNLP/HardMTBench.

Problem

Research questions and friction points this paper is trying to address.

machine translation

Chinese-English translation

knowledge-intensive domains

benchmark saturation

translation evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

HardMTBench

knowledge-intensive translation

difficulty-aware benchmark