🤖 AI Summary
This study addresses the gap in machine translation evaluation by recognizing that existing metrics overlook users’ needs for systems to adhere to multidimensional constraints—such as formatting, terminology, and register—as specified in instructions. To bridge this gap, the authors introduce the first systematic benchmark for instruction-following in translation across seven languages, encompassing both single- and multi-constraint samples. Evaluation combines deterministic checkers with rule-based large language model scoring. The work further proposes a novel multiplicative fusion scoring mechanism designed to be robust against reward gaming. Experiments across 15 models reveal that instruction-following capability improves significantly with model scale, that terminology and structured formatting constraints are the most challenging to satisfy, and that general instruction-following ability exhibits only weak correlation with actual translation behavior.
📝 Abstract
Modern translation workflows demand more than semantic equivalence. Users routinely require models to preserve JSON or HTML schemas, honor curated glossaries, disambiguate with provided context, and match prescribed registers, often several at once. Conventional metrics such as BLEU and xCOMET capture semantic fidelity but provide little signal on constraint adherence, while general instruction following benchmarks ignore the cross-lingual nature of translation. We introduce \bench, a benchmark for multilingual translation instruction following covering seven languages, with 4,506 single-constraint and 2,838 multi-constraint items spanning six constraint dimensions and five compositional patterns with instructions issued in all seven languages. Constraints are split into a gating subset verified by deterministic checkers and a continuous subset scored by a rubric-based LLM judge, combined under a multiplicative rule that resists reward hacking. Evaluating 15 models reveals systematic gaps that prior protocols miss: Instruction following scales with size more sharply than translation quality, glossary and structured-format constraints dominate the difficulty gradient, and general instruction following rankings correlate only weakly with translation behavior. Our benchmark are available at https://github.com/Tencent-Hunyuan/Hy-MT2/tree/main/IFMTBench.