A comparison of translation performance between DeepL and Supertext

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of cross-sentence consistency in commercial machine translation (MT) systems when translating long documents. We propose and implement, for the first time, a document-level contextual evaluation paradigm to compare DeepL and Supertext on unsegmented source texts. Our methodology employs professional human evaluators who assess translations holistically—considering full-document context—across four language pairs: English↔German and English↔French. We release an open-source evaluation dataset and assessment scripts. Results show that Supertext significantly outperforms DeepL in three language directions, particularly on document-level consistency metrics—including coreference resolution, terminology uniformity, and logical coherence—defects invisible to conventional segment-level evaluation. This work advances MT evaluation from sentence-level to document-level, establishing a new benchmark and reproducible resources for developing and assessing context-aware translation systems.

Technology Category

Application Category

📝 Abstract
As strong machine translation (MT) systems are increasingly based on large language models (LLMs), reliable quality benchmarking requires methods that capture their ability to leverage extended context. This study compares two commercial MT systems -- DeepL and Supertext -- by assessing their performance on unsegmented texts. We evaluate translation quality across four language directions with professional translators assessing segments with full document-level context. While segment-level assessments indicate no strong preference between the systems in most cases, document-level analysis reveals a preference for Supertext in three out of four language directions, suggesting superior consistency across longer texts. We advocate for more context-sensitive evaluation methodologies to ensure that MT quality assessments reflect real-world usability. We release all evaluation data and scripts for further analysis and reproduction at https://github.com/supertext/evaluation_deepl_supertext.
Problem

Research questions and friction points this paper is trying to address.

Compare translation performance
Assess unsegmented texts quality
Evaluate context-sensitive methodologies
Innovation

Methods, ideas, or system contributions that make the work stand out.

evaluates unsegmented text translations
compares DeepL and Supertext systems
uses document-level context assessments
🔎 Similar Papers
No similar papers found.
A
Alex Fluckiger
Supertext
Chantal Amrhein
Chantal Amrhein
Statistical Office, Canton of Zurich
Natural Language ProcessingComputational Linguistics
T
Tim Graf
Supertext
P
Philippe Schlapfer
Supertext
Florian Schottmann
Florian Schottmann
Supertext
Natural Language Processing
S
Samuel Laubli
Supertext