🤖 AI Summary
This study investigates how reinforcement learning (RL) and supervised fine-tuning (SFT) affect the cross-lingual reasoning generalization of large language models (LLMs), focusing on non-English mathematical, commonsense, and scientific reasoning tasks. Using Qwen2.5-3B-Base as the base model, we conduct systematic performance comparisons between RL and SFT on multilingual reasoning benchmarks and analyze the mechanistic role of non-English training data. Our key contributions are: (1) the first empirical demonstration that RL substantially outperforms SFT in cross-lingual reasoning, exhibiting superior generalization and robustness across languages; and (2) evidence that incorporating non-English data during training effectively mitigates English-centric bias, leading to consistent improvements in multilingual reasoning accuracy and cross-lingual strategy transfer. These findings provide both a novel methodology and empirical foundation for developing truly multilingual, general-purpose reasoning models.
📝 Abstract
Enhancing the complex reasoning capabilities of Large Language Models (LLMs) attracts widespread attention. While reinforcement learning (RL) has shown superior performance for improving complex reasoning, its impact on cross-lingual generalization compared to Supervised Fine-Tuning (SFT) remains unexplored. We present the first systematic investigation into cross-lingual reasoning generalization of RL and SFT. Using Qwen2.5-3B-Base as our foundation model, we conduct experiments on diverse multilingual reasoning benchmarks, including math reasoning, commonsense reasoning, and scientific reasoning. Our investigation yields two significant findings: (1) Tuning with RL not only achieves higher accuracy but also demonstrates substantially stronger cross-lingual generalization capabilities compared to SFT. (2) RL training on non-English data yields better overall performance and generalization than training on English data, which is not observed with SFT. Furthermore, through comprehensive mechanistic analyses, we explore the underlying factors of RL's superiority and generalization across languages. Our results provide compelling evidence that RL enables the model with more robust reasoning strategies, offering crucial guidance for more equitable and effective multilingual reasoning.