๐ค AI Summary
Large language models (LLMs) exhibit suboptimal performance in low-resource language translation, yet the underlying failure mechanisms remain poorly understood. This work proposes Token Activation Rate (TAR) as a proxy metric to quantify language-specific token utilization efficiency and constructs a multilingual evaluation framework spanning 15 models and 22 language pairs across varying resource levels. Integrating COMET automatic evaluation, the study revealsโfor the first timeโa strong correlation between TAR and translation quality. It finds that non-English-centric language pairs suffer significantly degraded performance, and that reasoning-capable LLMs tend to generate more tokens in low-TAR languages to compensate for inadequate representations, albeit with limited effectiveness. This research offers a novel perspective and quantitative tool for diagnosing bottlenecks in low-resource machine translation.
๐ Abstract
Large Language Models (LLMs) have recently demonstrated strong performance in machine translation (MT). However, most prior work focuses on improving or benchmarking translation quality, offering limited insight into when and why LLM-based translation fails. In this work, we systematically analyze failure modes of LLMs in MT by evaluating 15 models, including four reasoning LLMs, across 22 language pairs (LPs) with varying resource levels. We find that non-English-centric LPs consistently yield lower COMET scores than English-centric pairs. To investigate the underlying causes, we introduce Token Activation Rate (TAR), a metric that captures how effectively a model utilizes language-specific tokens in its vocabulary during generation. We validate TAR as a proxy for language representation using models with known language distributions in the training data, and show that lower TAR is strongly associated with poorer translation performance. Furthermore, reasoning LLMs tend to generate more tokens when translating into low-TAR languages, suggesting a compensatory mechanism, although its impact on translation quality varies across models. Overall, our findings emphasize the importance of token-level dynamics in understanding MT performance of LLMs.