🤖 AI Summary
Large reasoning models (LRMs) exhibit a pervasive “English-default” bias in multilingual reasoning, undermining cultural contextual understanding and interpretability. This work presents the first systematic investigation of cross-lingual cognitive reasoning behavior, analyzing inference pathways across languages on the MGSM and GPQA Diamond benchmarks. We find that while English-based reasoning significantly improves overall accuracy—especially on complex tasks—it introduces a novel error class: *translation drift*, wherein critical semantic content from the source language is distorted during translation into English prior to reasoning. Our analysis reveals a fundamental tension between linguistic translation and reasoning capability, demonstrating that accuracy alone is insufficient for evaluating multilingual reasoning. We argue that semantic fidelity must be jointly optimized with correctness, and provide cognitive-level evidence supporting the development of truly multilingual-native reasoning models—models that reason natively in diverse languages without mandatory English mediation.
📝 Abstract
Large Reasoning Models (LRMs) achieve strong performance on mathematical, scientific, and other question-answering tasks, but their multilingual reasoning abilities remain underexplored. When presented with non-English questions, LRMs often default to reasoning in English, raising concerns about interpretability and the handling of linguistic and cultural nuances. We systematically compare an LRM's reasoning in English versus the language of the question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond measuring answer accuracy, we also analyze cognitive attributes in the reasoning traces. We find that English reasoning traces exhibit a substantially higher presence of these cognitive behaviors, and that reasoning in English generally yields higher final-answer accuracy, with the performance gap increasing as tasks become more complex. However, this English-centric strategy is susceptible to a key failure mode - getting "Lost in Translation," where translation steps lead to errors that would have been avoided by question's language reasoning.