Language as a Latent Variable for Reasoning Optimization

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This study investigates how language functions as a latent variable influencing the reasoning capabilities of large language models, revealing that non-English responses can outperform English ones on certain reasoning tasks. To harness this insight, the authors propose polyGRPO, a multilingual reinforcement learning framework that requires no chain-of-thought annotations. It leverages language switching as an implicit exploration signal to optimize reasoning strategies, trained via online multilingual preference data generation and the Polyglot Thinking Experiment. Fine-tuning Qwen2.5-7B-Instruct with only 18.1K multilingual math problems yields accuracy improvements of 6.72% on four English reasoning benchmarks, 6.89% on multilingual benchmarks, and 4.9% on unseen English commonsense reasoning tasks, demonstrating significant cross-task generalization.

Technology Category

Application Category

📝 Abstract

As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model's internal inference pathways, rather than merely serving as an output medium. To test this, we conducted a Polyglot Thinking Experiment, in which models were prompted to solve identical problems under language-constrained and language-unconstrained conditions. Results show that non-English responses often achieve higher accuracy, and the best performance frequently occur when language is unconstrained, suggesting that multilinguality broadens the model's latent reasoning space. Based on this insight, we propose polyGRPO (Polyglot Group Relative Policy Optimization), an RL framework that treats language variation as an implicit exploration signal. It generates polyglot preference data online under language-constrained and unconstrained conditions, optimizing the policy with respect to both answer accuracy and reasoning structure. Trained on only 18.1K multilingual math problems without chain-of-thought annotations, polyGRPO improves the base model (Qwen2.5-7B-Instruct) by 6.72% absolute accuracy on four English reasoning testset and 6.89% in their multilingual benchmark. Remarkably, it is the only method that surpasses the base LLM on English commonsense reasoning task (4.9%), despite being trained solely on math data-highlighting its strong cross-task generalization. Further analysis reveals that treating language as a latent variable expands the model's latent reasoning space, yielding consistent and generalizable improvements in reasoning performance.

Problem

Research questions and friction points this paper is trying to address.

language as latent variable

reasoning optimization

multilingual reasoning

polyglot thinking

reasoning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent variable

polyglot reasoning

reinforcement learning