🤖 AI Summary
This study investigates the contextual modeling capability of large language models (LLMs) for document-level machine translation, specifically addressing cross-sentence dependencies such as pronominal coreference and lexical cohesion. We propose a Chain-of-Thought (CoT)-guided, context-aware translation framework and conduct a systematic evaluation of 12 state-of-the-art LLMs—including DeepSeek-R1, GPT-4/4o, Llama, Mistral, and Phi series—on the DiscEvalMT benchmark, measuring both discriminative accuracy and generative quality. Results show that CoT substantially enhances contextual understanding: the best-performing model achieves 90% accuracy on discriminative tasks and a COMET score of 92% on generative tasks. Notably, we identify for the first time a “wisdom amplifies strength” effect: models with stronger baseline capabilities exhibit proportionally greater performance gains from CoT. This work advances LLM-based document-level translation by introducing a principled, interpretable methodology and providing empirical evidence for scalable, context-sensitive MT.
📝 Abstract
This paper assesses the capacity of large language models (LLMs) to translate texts that include inter-sentential dependencies. We use the English-French DiscEvalMT benchmark (Bawden et al., 2018) with pairs of sentences containing translation challenges either for pronominal anaphora or for lexical cohesion. We evaluate 12 LLMs from the DeepSeek-R1, GPT, Llama, Mistral and Phi families on two tasks: (1) distinguishing a correct translation from a wrong but plausible one; (2) generating a correct translation. We compare prompts that encourage chain-of-thought reasoning with those that do not. The best models take advantage of reasoning and reach about 90% accuracy on the first task, and COMET scores of about 92% on the second task, with GPT-4, GPT-4o and Phi standing out. Moreover, we observe a "wise get wiser" effect: the improvements through reasoning are positively correlated with the scores of the models without reasoning.