Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

140K/year

🤖 AI Summary

This work addresses the pronounced performance gap in mathematical reasoning between high-resource and low-resource languages exhibited by large language models. To mitigate this cross-lingual imbalance, the authors propose Cross-lingual Online Policy Self-Distillation (COPSD), the first approach to adapt online policy self-distillation to multilingual reasoning. In COPSD, a teacher model—given access to English translations and reference solutions—provides dense supervision signals to a student model that processes only the original low-resource language input, via token-level KL divergence minimization. This strategy circumvents the sparse reward problem inherent in reinforcement learning and enhances adherence to answer formats and test-time scalability. Evaluated across 17 African low-resource languages, COPSD substantially outperforms baselines such as GRPO, with the largest gains observed in the most resource-constrained languages, and demonstrates strong generalization on more challenging multilingual reasoning benchmarks.

📝 Abstract

Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, but this ability is not equally accessible across languages. Especially low-resource languages exhibit much lower reasoning performance. To address this, we propose Crosslingual On-Policy Self-Distillation (COPSD), which transfers a model's own high-resource reasoning behavior to low-resource languages. COPSD uses the same model as student and teacher: the student sees only the low-resource problem, while the teacher receives privileged crosslingual context, including the problem translation and reference solution in English. Training minimizes full-distribution token-level divergence on the student's own rollouts, providing dense supervision while avoiding the sparsity and instability of outcome-only reinforcement learning (RL). Experiments on 17 low-resource African languages show that COPSD consistently improves low-resource mathematical reasoning across model sizes and substantially outperforms Group Relative Policy Optimization (GRPO). Further analyses show that COPSD improves answer-format adherence, strengthens test-time scaling, and generalizes to harder multilingual reasoning benchmarks, with especially large gains for lower-resource languages. We make our code and data available at: https://github.com/cisnlp/COPSD.

Problem

Research questions and friction points this paper is trying to address.

multilingual reasoning

low-resource languages

mathematical reasoning

crosslingual transfer

language disparity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Crosslingual Self-Distillation

Multilingual Reasoning

On-Policy Learning