🤖 AI Summary
This study addresses the lack of cross-sentence consistency in commercial machine translation (MT) systems when translating long documents. We propose and implement, for the first time, a document-level contextual evaluation paradigm to compare DeepL and Supertext on unsegmented source texts. Our methodology employs professional human evaluators who assess translations holistically—considering full-document context—across four language pairs: English↔German and English↔French. We release an open-source evaluation dataset and assessment scripts. Results show that Supertext significantly outperforms DeepL in three language directions, particularly on document-level consistency metrics—including coreference resolution, terminology uniformity, and logical coherence—defects invisible to conventional segment-level evaluation. This work advances MT evaluation from sentence-level to document-level, establishing a new benchmark and reproducible resources for developing and assessing context-aware translation systems.
📝 Abstract
As strong machine translation (MT) systems are increasingly based on large language models (LLMs), reliable quality benchmarking requires methods that capture their ability to leverage extended context. This study compares two commercial MT systems -- DeepL and Supertext -- by assessing their performance on unsegmented texts. We evaluate translation quality across four language directions with professional translators assessing segments with full document-level context. While segment-level assessments indicate no strong preference between the systems in most cases, document-level analysis reveals a preference for Supertext in three out of four language directions, suggesting superior consistency across longer texts. We advocate for more context-sensitive evaluation methodologies to ensure that MT quality assessments reflect real-world usability. We release all evaluation data and scripts for further analysis and reproduction at https://github.com/supertext/evaluation_deepl_supertext.