Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Existing multilingual reasoning models heavily rely on English, exhibiting weak logical reasoning capabilities for non-English languages (e.g., Korean) and lacking efficient approaches to bridge this gap. Method: We propose Language-Mixed Chain-of-Thought (LM-CoT), which anchors reasoning in English while dynamically interleaving English and Korean tokens during inference to mitigate translation distortion and enhance logical deduction in non-English contexts. We introduce Yi-Sang, the first high-quality Korean reasoning dataset, and generate long CoT trajectories using Qwen3-32B. Leveraging knowledge distillation and multilingual data cleaning, we train models spanning six language families and nine parameter scales (4B–35B). Contribution/Results: Our released KO-REAson-35B achieves state-of-the-art performance on five of nine multilingual benchmarks, with an average score of 64.0±25. Smaller variants show an average gain of 18.6 points over monolingual CoT baselines—demonstrating substantial improvements in cross-lingual reasoning capability.

Technology Category

Application Category

📝 Abstract

Recent frontier models employ long chain-of-thought reasoning to explore solution spaces in context and achieve stonger performance. While many works study distillation to build smaller yet capable models, most focus on English and little is known about language-specific reasoning. To bridge this gap, we first introduct **Language-Mixed CoT**, a reasoning schema that switches between English and a target language, using English as an anchor to excel in reasoning while minimizing translation artificats. As a Korean case study, we curate **Yi-Sang**: 5.79M native-Korean prompts from web Q&A, exams, STEM, and code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k high-yield subset. We train ninve models (4B-35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc). Our best model, **KO-REAson-35B**, achieves state-of-the-art performance, with the highest overall average score (64.0 pm 25), ranking first on 5/9 benchmarks and second on the remainder. Samller and mid-sized models also benefit substantially, with an average improvement of +18.6 points across teh evaluated nine benchmarks. Ablations show **Language-Mixed CoT** is more effective than monolingual CoT, also resulting in cross-lingual and mult-modal performance gains. We release our data-curation pipeline, evaluation system, datasets, and models to advance research on language-specific reasoning. Data and model collection: https://huggingface.co/KOREAson.

Problem

Research questions and friction points this paper is trying to address.

Addressing language-specific reasoning gaps in multilingual models

Developing Korean-focused reasoning via mixed-language chain-of-thought

Creating models that excel in native language reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-Mixed CoT switches between English and target language

Generated 3.7M reasoning traces using Qwen3-32B model

Trained models across six families achieving state-of-the-art performance

🔎 Similar Papers

Large Language Models Are Cross-Lingual Knowledge-Free Reasoners