🤖 AI Summary
This study identifies an implicit language bias in large reasoning models (LRMs): when processing multilingual inputs, LRMs default to high-resource languages—particularly English—for internal reasoning, severely degrading performance on low-resource language tasks. To systematically investigate this, we design a multidimensional, controllable evaluation framework covering MMMLU, MATH-500, CulturalBench, and LMSYS-toxic, integrating reasoning-path tracing and cross-lingual attribution analysis. We empirically demonstrate that enforcing same-language reasoning—though it reduces general reasoning capability (especially for low-resource languages)—significantly improves cultural alignment and language-specific accuracy in safety evaluation. Crucially, this work is the first to empirically establish “reasoning-language–input-language mismatch” as a fundamental bottleneck to multilingual fairness. Our findings provide both theoretical grounding and methodological tools for developing language-neutral LRMs.
📝 Abstract
Large reasoning models (LRMs) have demonstrated impressive performance across a range of reasoning tasks, yet little is known about their internal reasoning processes in multilingual settings. We begin with a critical question: {it In which language do these models reason when solving problems presented in different languages?} Our findings reveal that, despite multilingual training, LRMs tend to default to reasoning in high-resource languages (e.g., English) at test time, regardless of the input language. When constrained to reason in the same language as the input, model performance declines, especially for low-resource languages. In contrast, reasoning in high-resource languages generally preserves performance. We conduct extensive evaluations across reasoning-intensive tasks (MMMLU, MATH-500) and non-reasoning benchmarks (CulturalBench, LMSYS-toxic), showing that the effect of language choice varies by task type: input-language reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior. By exposing these linguistic biases in LRMs, our work highlights a critical step toward developing more equitable models that serve users across diverse linguistic backgrounds.