Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

170K/year
🤖 AI Summary
This work addresses the pronounced performance gap in mathematical reasoning between high-resource and low-resource languages exhibited by large language models. To mitigate this cross-lingual imbalance, the authors propose Cross-lingual Online Policy Self-Distillation (COPSD), the first approach to adapt online policy self-distillation to multilingual reasoning. In COPSD, a teacher model—given access to English translations and reference solutions—provides dense supervision signals to a student model that processes only the original low-resource language input, via token-level KL divergence minimization. This strategy circumvents the sparse reward problem inherent in reinforcement learning and enhances adherence to answer formats and test-time scalability. Evaluated across 17 African low-resource languages, COPSD substantially outperforms baselines such as GRPO, with the largest gains observed in the most resource-constrained languages, and demonstrates strong generalization on more challenging multilingual reasoning benchmarks.
📝 Abstract
Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, but this ability is not equally accessible across languages. Especially low-resource languages exhibit much lower reasoning performance. To address this, we propose Crosslingual On-Policy Self-Distillation (COPSD), which transfers a model's own high-resource reasoning behavior to low-resource languages. COPSD uses the same model as student and teacher: the student sees only the low-resource problem, while the teacher receives privileged crosslingual context, including the problem translation and reference solution in English. Training minimizes full-distribution token-level divergence on the student's own rollouts, providing dense supervision while avoiding the sparsity and instability of outcome-only reinforcement learning (RL). Experiments on 17 low-resource African languages show that COPSD consistently improves low-resource mathematical reasoning across model sizes and substantially outperforms Group Relative Policy Optimization (GRPO). Further analyses show that COPSD improves answer-format adherence, strengthens test-time scaling, and generalizes to harder multilingual reasoning benchmarks, with especially large gains for lower-resource languages. We make our code and data available at: https://github.com/cisnlp/COPSD.
Problem

Research questions and friction points this paper is trying to address.

multilingual reasoning
low-resource languages
mathematical reasoning
crosslingual transfer
language disparity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Crosslingual Self-Distillation
Multilingual Reasoning
On-Policy Learning
Low-Resource Languages
Token-Level Distillation
🔎 Similar Papers
No similar papers found.