Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the limitations of large language models (LLMs) in document-level machine translation, where scarce high-quality parallel corpora often lead to hallucinations and content omissions, resulting in performance inferior to traditional encoder-decoder architectures. To mitigate these issues, the authors propose a novel framework combining synthetic data filtering with a two-stage fine-tuning strategy. First, they transform summarization data into document-level parallel corpora and apply a multi-metric filtering approach based on sacreBLEU, COMET, and LaBSE cosine similarity. Subsequently, they perform progressive fine-tuning—initially at the sentence level and then at the document level—to enhance model reliability and contextual coherence. This approach effectively alleviates data scarcity and generation instability, significantly improving translation quality and discourse-level consistency.

Technology Category

Application Category

📝 Abstract

In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.

Problem

Research questions and friction points this paper is trying to address.

document-level machine translation

large language models

parallel corpora scarcity

hallucination

omission

Innovation

Methods, ideas, or system contributions that make the work stand out.

document-level machine translation

large language models

synthetic corpus filtering