🤖 AI Summary
This work addresses the limitations of large language models (LLMs) in document-level machine translation, where scarce high-quality parallel corpora often lead to hallucinations and content omissions, resulting in performance inferior to traditional encoder-decoder architectures. To mitigate these issues, the authors propose a novel framework combining synthetic data filtering with a two-stage fine-tuning strategy. First, they transform summarization data into document-level parallel corpora and apply a multi-metric filtering approach based on sacreBLEU, COMET, and LaBSE cosine similarity. Subsequently, they perform progressive fine-tuning—initially at the sentence level and then at the document level—to enhance model reliability and contextual coherence. This approach effectively alleviates data scarcity and generation instability, significantly improving translation quality and discourse-level consistency.
📝 Abstract
In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.