Language as a Latent Variable for Reasoning Optimization

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

172K/year
🤖 AI Summary
This study investigates how language functions as a latent variable influencing the reasoning capabilities of large language models, revealing that non-English responses can outperform English ones on certain reasoning tasks. To harness this insight, the authors propose polyGRPO, a multilingual reinforcement learning framework that requires no chain-of-thought annotations. It leverages language switching as an implicit exploration signal to optimize reasoning strategies, trained via online multilingual preference data generation and the Polyglot Thinking Experiment. Fine-tuning Qwen2.5-7B-Instruct with only 18.1K multilingual math problems yields accuracy improvements of 6.72% on four English reasoning benchmarks, 6.89% on multilingual benchmarks, and 4.9% on unseen English commonsense reasoning tasks, demonstrating significant cross-task generalization.

Technology Category

Application Category

📝 Abstract
As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model's internal inference pathways, rather than merely serving as an output medium. To test this, we conducted a Polyglot Thinking Experiment, in which models were prompted to solve identical problems under language-constrained and language-unconstrained conditions. Results show that non-English responses often achieve higher accuracy, and the best performance frequently occur when language is unconstrained, suggesting that multilinguality broadens the model's latent reasoning space. Based on this insight, we propose polyGRPO (Polyglot Group Relative Policy Optimization), an RL framework that treats language variation as an implicit exploration signal. It generates polyglot preference data online under language-constrained and unconstrained conditions, optimizing the policy with respect to both answer accuracy and reasoning structure. Trained on only 18.1K multilingual math problems without chain-of-thought annotations, polyGRPO improves the base model (Qwen2.5-7B-Instruct) by 6.72% absolute accuracy on four English reasoning testset and 6.89% in their multilingual benchmark. Remarkably, it is the only method that surpasses the base LLM on English commonsense reasoning task (4.9%), despite being trained solely on math data-highlighting its strong cross-task generalization. Further analysis reveals that treating language as a latent variable expands the model's latent reasoning space, yielding consistent and generalizable improvements in reasoning performance.
Problem

Research questions and friction points this paper is trying to address.

language as latent variable
reasoning optimization
multilingual reasoning
polyglot thinking
reasoning performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent variable
polyglot reasoning
reinforcement learning
language modulation
cross-task generalization
🔎 Similar Papers
No similar papers found.