🤖 AI Summary
This study addresses the significant performance degradation of multilingual large language models in non-English reasoning, where maintaining both target-language chain-of-thought and reasoning accuracy remains challenging. The authors construct a six-language long-reasoning dataset and train native-language reasoning and English-pivot models based on Qwen3-8B-Base. Through weight-space analysis, they identify that core reasoning capabilities are concentrated in intermediate layers. Leveraging this insight, they propose an innovative Layer Swap method that exchanges intermediate-layer parameters to enhance native-language reasoning. Experiments demonstrate that this approach reduces the average reasoning gap to 1.9–3.5% across five non-English languages, effectively closing the performance gap while preserving target-language chain-of-thought throughout. This work further reveals, for the first time, a language-agnostic reasoning core alongside language-specific peripheral layer structures.
📝 Abstract
Recent reasoning Large Language Models produce a chain-of-thought (CoT) predominantly in English, even when prompted in non-English languages. Prior work suggests that forcing the CoT to remain in the input language (\emph{native reasoning}) substantially degrades performance relative to allowing the model to reason in English before answering in the input language (\emph{English-pivoted reasoning}). However, most studies of this native reasoning gap rely on inference-time interventions or limited native-language training data. We revisit this comparison at a larger scale and under comparable supervision. We construct long multilingual reasoning datasets across six languages (English, French, German, Spanish, Chinese and Swahili); fine-tune specialists in both native and English-pivoted regimes on top of \texttt{Qwen/Qwen3-8B-Base}, and evaluate across mathematics, science, general knowledge, and code. In this setting, the average native reasoning gap shrinks to 1.9--3.5\% across the five non-English languages, considerably smaller than previously reported. Weight-space analysis of the native specialists reveals aligned fine-tuning updates in the middle layers and divergence in the outer layers. This points to a largely language-agnostic reasoning core surrounded by language-specific layers. Exploiting this structure, we introduce a Layer Swap: transferring the English specialist's stronger reasoning mid-layers into each native specialist, closing most of the native reasoning gap across the five non-English languages while preserving CoT in the target language. We release all models and datasets.