Long Chain-of-Thought Reasoning Across Languages

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates cross-lingual transfer of chain-of-thought (CoT) reasoning in large language models (LLMs), focusing on French, Japanese, Latvian, and Swahili. To address the scarcity of high-quality non-English CoT data, we construct a rigorously curated multilingual CoT dataset via expert-guided translation and perform lightweight fine-tuning on Qwen-2.5 7B and Qwen-3 8B. Our findings reveal: (1) English as a pivot language imposes a substantial bottleneck for cross-lingual CoT transfer; (2) combining multilingual pretraining with small-scale, high-fidelity fine-tuning significantly boosts reasoning performance in low-resource languages—e.g., +30% absolute gain on Swahili; and (3) the trade-off between data quality and quantity is language-specific. Contributions include identifying key constraints on cross-lingual CoT generalization, proposing an efficient fine-tuning paradigm tailored to low-resource settings, and releasing the first open-source multilingual CoT benchmark—enabling fair, reproducible research on multilingual reasoning.

Technology Category

Application Category

📝 Abstract
Scaling inference through long chains-of-thought (CoTs) has unlocked impressive reasoning capabilities in large language models (LLMs), yet the reasoning process remains almost exclusively English-centric. We construct translated versions of two popular English reasoning datasets, fine-tune Qwen 2.5 (7B) and Qwen 3 (8B) models, and present a systematic study of long CoT generation across French, Japanese, Latvian, and Swahili. Our experiments reveal three key findings. First, the efficacy of using English as a pivot language varies by language: it provides no benefit for French, improves performance when used as the reasoning language for Japanese and Latvian, and proves insufficient for Swahili where both task comprehension and reasoning remain poor. Second, extensive multilingual pretraining in Qwen 3 narrows but does not eliminate the cross-lingual performance gap. A lightweight fine-tune using only 1k traces still improves performance by over 30% in Swahili. Third, data quality versus scale trade-offs are language dependent: small, carefully curated datasets suffice for English and French, whereas larger but noisier corpora prove more effective for Swahili and Latvian. Together, these results clarify when and why long CoTs transfer across languages and provide translated datasets to foster equitable multilingual reasoning research.
Problem

Research questions and friction points this paper is trying to address.

Investigating long chain-of-thought reasoning transfer across multiple languages
Evaluating English as pivot language effectiveness for non-English reasoning
Analyzing data quality versus scale trade-offs in multilingual fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Translated reasoning datasets for multilingual fine-tuning
Lightweight fine-tuning with 1k traces for performance
Language-dependent data quality versus scale trade-offs
🔎 Similar Papers
No similar papers found.