Extending Automatic Machine Translation Evaluation to Book-Length Documents

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Current automatic machine translation evaluation is constrained by sentence-level datasets, fixed sentence boundary assumptions, and token-length limitations, rendering it inadequate for book-length documents. To address this, we propose SEGALE—the first document-level automatic evaluation framework that operates without predefined sentence boundaries and supports under-translation/over-translation detection. SEGALE extends mainstream metrics to arbitrarily long texts via continuous text segmentation and dynamic sentence alignment. Experiments demonstrate that SEGALE significantly outperforms existing methods in book-level evaluation, achieving performance close to that of ground-truth sentence-aligned references. Moreover, it provides the first empirical evidence that multiple open-source large language models exhibit systematic overestimation of translation capability when evaluated at their maximum context length—highlighting a critical limitation in current evaluation practices.

Technology Category

Application Category

📝 Abstract

Despite Large Language Models (LLMs) demonstrating superior translation performance and long-context capabilities, evaluation methodologies remain constrained to sentence-level assessment due to dataset limitations, token number restrictions in metrics, and rigid sentence boundary requirements. We introduce SEGALE, an evaluation scheme that extends existing automatic metrics to long-document translation by treating documents as continuous text and applying sentence segmentation and alignment methods. Our approach enables previously unattainable document-level evaluation, handling translations of arbitrary length generated with document-level prompts while accounting for under-/over-translations and varied sentence boundaries. Experiments show our scheme significantly outperforms existing long-form document evaluation schemes, while being comparable to evaluations performed with groundtruth sentence alignments. Additionally, we apply our scheme to book-length texts and newly demonstrate that many open-weight LLMs fail to effectively translate documents at their reported maximum context lengths.

Problem

Research questions and friction points this paper is trying to address.

Extending machine translation evaluation beyond sentence-level to book-length documents

Overcoming dataset limitations and token restrictions in current evaluation metrics

Addressing challenges of varied sentence boundaries and under-over translations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends metrics to long documents via continuous text processing

Uses sentence segmentation and alignment for document evaluation

Handles arbitrary length translations with boundary flexibility

🔎 Similar Papers

No similar papers found.