A comparison of translation performance between DeepL and Supertext

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This study addresses the lack of cross-sentence consistency in commercial machine translation (MT) systems when translating long documents. We propose and implement, for the first time, a document-level contextual evaluation paradigm to compare DeepL and Supertext on unsegmented source texts. Our methodology employs professional human evaluators who assess translations holistically—considering full-document context—across four language pairs: English↔German and English↔French. We release an open-source evaluation dataset and assessment scripts. Results show that Supertext significantly outperforms DeepL in three language directions, particularly on document-level consistency metrics—including coreference resolution, terminology uniformity, and logical coherence—defects invisible to conventional segment-level evaluation. This work advances MT evaluation from sentence-level to document-level, establishing a new benchmark and reproducible resources for developing and assessing context-aware translation systems.

Technology Category

Application Category

📝 Abstract

As strong machine translation (MT) systems are increasingly based on large language models (LLMs), reliable quality benchmarking requires methods that capture their ability to leverage extended context. This study compares two commercial MT systems -- DeepL and Supertext -- by assessing their performance on unsegmented texts. We evaluate translation quality across four language directions with professional translators assessing segments with full document-level context. While segment-level assessments indicate no strong preference between the systems in most cases, document-level analysis reveals a preference for Supertext in three out of four language directions, suggesting superior consistency across longer texts. We advocate for more context-sensitive evaluation methodologies to ensure that MT quality assessments reflect real-world usability. We release all evaluation data and scripts for further analysis and reproduction at https://github.com/supertext/evaluation_deepl_supertext.

Problem

Research questions and friction points this paper is trying to address.

Compare translation performance

Assess unsegmented texts quality

Evaluate context-sensitive methodologies

Innovation

Methods, ideas, or system contributions that make the work stand out.

evaluates unsegmented text translations

compares DeepL and Supertext systems

uses document-level context assessments

🔎 Similar Papers

No similar papers found.