🤖 AI Summary
This work investigates whether introducing intermediate “reasoning tokens” in large reasoning models (LRMs) improves machine translation performance. Addressing the limitation of existing approaches—which mechanically imitate chain-of-thought reasoning without capturing translation-specific semantics—we propose a translation-aware intermediate representation generation paradigm. Specifically, we synthesize chain-of-thought data containing concrete translation attempts using LRMs, then employ modular prompting and knowledge distillation to train models across multilingual and multi-resource settings. Experimental results show that merely appending abstract reasoning tokens yields no gains; however, when intermediate representations explicitly encode translation steps—such as alignment, transduction, and post-editing—BLEU scores improve by 2.1–4.3 points on average. Our key contribution is demonstrating that modeling the translation process itself is more effective than mimicking generic reasoning paths, and that data-driven optimization of intermediate representations outperforms structured reasoning prompts in translation quality.
📝 Abstract
Large reasoning models (LRMs) have led to new possibilities in terms of problem-solving, through the devising of a natural language thought process prior to answering a query. While their capabilities are well known across mathematics and coding tasks, their impact on the task of machine translation (MT) remains underexplored. In this work, we explore the benefits of the generation of intermediate tokens when performing MT across multiple language pairs of different levels of resourcedness and multiple setups. We find that "thinking tokens" do not help LRMs better perform MT. This result generalizes to models fine-tuned to reason before translating using distilled chain of thought (CoT) inspired by human translators' practices. Specifically, fine-tuning a model with synthetic CoT explanations detailing how to translate step-by-step does not outperform standard input-output fine-tuning. However, constructing the intermediate tokens by combining the outputs of modular translation-specific prompting strategies results in improvements. Our findings underscore that the contribution of intermediate tokens during fine-tuning highly depends on the presence of translation attempts within them. More broadly, our results suggest that using a teacher to refine target translations or to expand parallel corpora is more impactful than distilling their CoT explanations into "thinking" MT models.