🤖 AI Summary
To address the significant performance gap in non-English reasoning exhibited by large language models (LLMs) compared to English, this paper proposes PB-RLSVR: a framework leveraging high-performing English LLMs as linguistic pivots and introducing a semantics-verifiable cross-lingual reward mechanism for multilingual reasoning alignment—without requiring human annotations in target languages. The mechanism integrates embedding similarity with machine translation back-translation to construct a semantic equivalence reward function, which is then optimized via reinforcement learning to refine non-English model responses. On multilingual reasoning benchmarks, PB-RLSVR improves Llama-3.1-8B-Instruct and Qwen3-32B by 16.41% and 10.17% on average, respectively, outperforming PPO-based baselines. Its core contribution lies in being the first to introduce semantics-verifiable rewards for cross-lingual reasoning transfer—achieving strong generalization, interpretability, and zero-shot annotation dependency.
📝 Abstract
While reinforcement learning has advanced the reasoning abilities of Large Language Models (LLMs), these gains are largely confined to English, creating a significant performance disparity across languages. To address this, we introduce Pivot-Based Reinforcement Learning with Semantically Verifiable Rewards (PB-RLSVR), a novel framework that enhances multilingual reasoning by circumventing the need for human-annotated data in target languages. Our approach employs a high-performing English LLM as a "pivot" model to generate reference responses for reasoning tasks. A multilingual model is then rewarded based on the semantic equivalence of its responses to the English reference, effectively transferring the pivot model's reasoning capabilities across languages. We investigate several cross-lingual semantic reward functions, including those based on embeddings and machine translation. Extensive experiments on a suite of multilingual reasoning benchmarks show that our method significantly narrows the performance gap between English and other languages, substantially outperforming traditional PPO baselines. Specifically, our PB-RLSVR framework improves the average multilingual performance of Llama-3.1-8B-Instruct and Qwen3-32B by 16.41% and 10.17%, respectively, demonstrating a powerful and data-efficient approach to building truly multilingual reasoning agents.