Multilingual Test-Time Scaling via Initial Thought Transfer

πŸ“… 2025-05-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work systematically investigates test-time scaling (TTS) in multilingual settings, revealing substantially diminished performance gains for low-resource languages compared to English, alongside pronounced language drift and intra-sample reasoning inconsistency. To address these issues, we propose MITTβ€”a lightweight, unsupervised, cross-lingual prefix tuning method that enhances multilingual reasoning consistency via zero-shot initial thought transfer. This is the first systematic study of multilingual TTS and the first to explicitly model and mitigate inter-lingual initial-thought inconsistency and generation instability. Experiments on DeepSeek-R1-Distill-Qwen-7B demonstrate that MITT significantly improves reasoning accuracy for low-resource languages, narrowing the performance gap with English by up to 32%, while simultaneously enhancing generation stability and cross-lingual reasoning consistency.

Technology Category

Application Category

πŸ“ Abstract
Test-time scaling has emerged as a widely adopted inference-time strategy for boosting reasoning performance. However, its effectiveness has been studied almost exclusively in English, leaving its behavior in other languages largely unexplored. We present the first systematic study of test-time scaling in multilingual settings, evaluating DeepSeek-R1-Distill-LLama-8B and DeepSeek-R1-Distill-Qwen-7B across both high- and low-resource Latin-script languages. Our findings reveal that the relative gains from test-time scaling vary significantly across languages. Additionally, models frequently switch to English mid-reasoning, even when operating under strictly monolingual prompts. We further show that low-resource languages not only produce initial reasoning thoughts that differ significantly from English but also have lower internal consistency across generations in their early reasoning. Building on our findings, we introduce MITT (Multilingual Initial Thought Transfer), an unsupervised and lightweight reasoning prefix-tuning approach that transfers high-resource reasoning prefixes to enhance test-time scaling across all languages, addressing inconsistencies in multilingual reasoning performance. MITT significantly boosts DeepSeek-R1-Distill-Qwen-7B's reasoning performance, especially for underrepresented languages.
Problem

Research questions and friction points this paper is trying to address.

Study test-time scaling in multilingual settings
Address reasoning inconsistencies across languages
Enhance multilingual performance via thought transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual test-time scaling via MITT
Transfer high-resource reasoning prefixes
Enhance multilingual reasoning performance
πŸ”Ž Similar Papers
No similar papers found.