Rethinking the Multilingual Reasoning Gap with Layer Swap

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the significant performance degradation of multilingual large language models in non-English reasoning, where maintaining both target-language chain-of-thought and reasoning accuracy remains challenging. The authors construct a six-language long-reasoning dataset and train native-language reasoning and English-pivot models based on Qwen3-8B-Base. Through weight-space analysis, they identify that core reasoning capabilities are concentrated in intermediate layers. Leveraging this insight, they propose an innovative Layer Swap method that exchanges intermediate-layer parameters to enhance native-language reasoning. Experiments demonstrate that this approach reduces the average reasoning gap to 1.9–3.5% across five non-English languages, effectively closing the performance gap while preserving target-language chain-of-thought throughout. This work further reveals, for the first time, a language-agnostic reasoning core alongside language-specific peripheral layer structures.
📝 Abstract
Recent reasoning Large Language Models produce a chain-of-thought (CoT) predominantly in English, even when prompted in non-English languages. Prior work suggests that forcing the CoT to remain in the input language (\emph{native reasoning}) substantially degrades performance relative to allowing the model to reason in English before answering in the input language (\emph{English-pivoted reasoning}). However, most studies of this native reasoning gap rely on inference-time interventions or limited native-language training data. We revisit this comparison at a larger scale and under comparable supervision. We construct long multilingual reasoning datasets across six languages (English, French, German, Spanish, Chinese and Swahili); fine-tune specialists in both native and English-pivoted regimes on top of \texttt{Qwen/Qwen3-8B-Base}, and evaluate across mathematics, science, general knowledge, and code. In this setting, the average native reasoning gap shrinks to 1.9--3.5\% across the five non-English languages, considerably smaller than previously reported. Weight-space analysis of the native specialists reveals aligned fine-tuning updates in the middle layers and divergence in the outer layers. This points to a largely language-agnostic reasoning core surrounded by language-specific layers. Exploiting this structure, we introduce a Layer Swap: transferring the English specialist's stronger reasoning mid-layers into each native specialist, closing most of the native reasoning gap across the five non-English languages while preserving CoT in the target language. We release all models and datasets.
Problem

Research questions and friction points this paper is trying to address.

multilingual reasoning
reasoning gap
chain-of-thought
native reasoning
language-specific layers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer Swap
native reasoning
multilingual reasoning
language-agnostic core
chain-of-thought