Chain-of-Thought Reasoning Improves Context-Aware Translation with Large Language Models

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study investigates the contextual modeling capability of large language models (LLMs) for document-level machine translation, specifically addressing cross-sentence dependencies such as pronominal coreference and lexical cohesion. We propose a Chain-of-Thought (CoT)-guided, context-aware translation framework and conduct a systematic evaluation of 12 state-of-the-art LLMs—including DeepSeek-R1, GPT-4/4o, Llama, Mistral, and Phi series—on the DiscEvalMT benchmark, measuring both discriminative accuracy and generative quality. Results show that CoT substantially enhances contextual understanding: the best-performing model achieves 90% accuracy on discriminative tasks and a COMET score of 92% on generative tasks. Notably, we identify for the first time a “wisdom amplifies strength” effect: models with stronger baseline capabilities exhibit proportionally greater performance gains from CoT. This work advances LLM-based document-level translation by introducing a principled, interpretable methodology and providing empirical evidence for scalable, context-sensitive MT.

Technology Category

Application Category

📝 Abstract

This paper assesses the capacity of large language models (LLMs) to translate texts that include inter-sentential dependencies. We use the English-French DiscEvalMT benchmark (Bawden et al., 2018) with pairs of sentences containing translation challenges either for pronominal anaphora or for lexical cohesion. We evaluate 12 LLMs from the DeepSeek-R1, GPT, Llama, Mistral and Phi families on two tasks: (1) distinguishing a correct translation from a wrong but plausible one; (2) generating a correct translation. We compare prompts that encourage chain-of-thought reasoning with those that do not. The best models take advantage of reasoning and reach about 90% accuracy on the first task, and COMET scores of about 92% on the second task, with GPT-4, GPT-4o and Phi standing out. Moreover, we observe a "wise get wiser" effect: the improvements through reasoning are positively correlated with the scores of the models without reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to translate texts with inter-sentential dependencies

Assessing translation challenges in pronominal anaphora and lexical cohesion

Comparing chain-of-thought reasoning prompts for context-aware translation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-thought reasoning enhances translation accuracy

Large language models handle inter-sentential dependencies

Reasoning prompts outperform standard translation methods

🔎 Similar Papers

No similar papers found.