🤖 AI Summary
This work investigates whether test-time computation scaling can improve the machine translation (MT) quality of general-purpose large language models (LLMs). Addressing three scenarios—direct translation, forced reasoning extrapolation, and post-editing—we propose a reasoning-depth adaptive mechanism and systematically evaluate 12 models across multi-domain MT benchmarks. Key findings are: (1) Unadapted general-purpose LLMs yield only marginal gains from direct test-time scaling; (2) Domain-adaptive fine-tuning substantially unlocks the effectiveness of test-time computation scaling, enabling consistent performance improvements; (3) Integrating a self-correction pipeline into post-editing stabilizes and enhances translation accuracy under increased reasoning depth. This study is the first to empirically demonstrate the critical enabling roles of domain fine-tuning and self-correction in test-time computation scaling for MT. It establishes a new paradigm for efficient, controllable MT inference grounded in principled reasoning-depth adaptation.
📝 Abstract
Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding, yet its efficacy in machine translation (MT) remains underexplored. This paper investigates whether increased inference-time computation improves translation quality. We evaluate 12 RMs across a diverse suite of MT benchmarks spanning multiple domains, examining three scenarios: direct translation, forced-reasoning extrapolation, and post-editing. Our findings show that for general-purpose RMs, TTS provides limited and inconsistent benefits for direct translation, with performance quickly plateauing. However, the effectiveness of TTS is unlocked by domain-specific fine-tuning, which aligns a model's reasoning process with task requirements, leading to consistent improvements up to an optimal, self-determined reasoning depth. We also find that forcing a model to reason beyond its natural stopping point consistently degrades translation quality. In contrast, TTS proves highly effective in a post-editing context, reliably turning self-correction into a beneficial process. These results indicate that the value of inference-time computation in MT lies not in enhancing single-pass translation with general models, but in targeted applications like multi-step, self-correction workflows and in conjunction with task-specialized models.