Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of large language models (LLMs) in document-level machine translation, where scarce high-quality parallel corpora often lead to hallucinations and content omissions, resulting in performance inferior to traditional encoder-decoder architectures. To mitigate these issues, the authors propose a novel framework combining synthetic data filtering with a two-stage fine-tuning strategy. First, they transform summarization data into document-level parallel corpora and apply a multi-metric filtering approach based on sacreBLEU, COMET, and LaBSE cosine similarity. Subsequently, they perform progressive fine-tuning—initially at the sentence level and then at the document level—to enhance model reliability and contextual coherence. This approach effectively alleviates data scarcity and generation instability, significantly improving translation quality and discourse-level consistency.

Technology Category

Application Category

📝 Abstract
In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.
Problem

Research questions and friction points this paper is trying to address.

document-level machine translation
large language models
parallel corpora scarcity
hallucination
omission
Innovation

Methods, ideas, or system contributions that make the work stand out.

document-level machine translation
large language models
synthetic corpus filtering
two-stage fine-tuning
hallucination mitigation
🔎 Similar Papers
No similar papers found.
I
Ireh Kim
Department of Artificial Intelligence, Korea University, Seoul, South Korea
T
Tesia Sker
Department of Artificial Intelligence, Korea University, Seoul, South Korea
Chanwoo Kim
Chanwoo Kim
Professor of Artificial Intelligence at Korea University
Speech RecognitionLanguage ProcessingDeep LearningSignal Processing