Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the significant performance gap in non-English reasoning exhibited by large language models (LLMs) compared to English, this paper proposes PB-RLSVR: a framework leveraging high-performing English LLMs as linguistic pivots and introducing a semantics-verifiable cross-lingual reward mechanism for multilingual reasoning alignment—without requiring human annotations in target languages. The mechanism integrates embedding similarity with machine translation back-translation to construct a semantic equivalence reward function, which is then optimized via reinforcement learning to refine non-English model responses. On multilingual reasoning benchmarks, PB-RLSVR improves Llama-3.1-8B-Instruct and Qwen3-32B by 16.41% and 10.17% on average, respectively, outperforming PPO-based baselines. Its core contribution lies in being the first to introduce semantics-verifiable rewards for cross-lingual reasoning transfer—achieving strong generalization, interpretability, and zero-shot annotation dependency.

Technology Category

Application Category

📝 Abstract
While reinforcement learning has advanced the reasoning abilities of Large Language Models (LLMs), these gains are largely confined to English, creating a significant performance disparity across languages. To address this, we introduce Pivot-Based Reinforcement Learning with Semantically Verifiable Rewards (PB-RLSVR), a novel framework that enhances multilingual reasoning by circumventing the need for human-annotated data in target languages. Our approach employs a high-performing English LLM as a "pivot" model to generate reference responses for reasoning tasks. A multilingual model is then rewarded based on the semantic equivalence of its responses to the English reference, effectively transferring the pivot model's reasoning capabilities across languages. We investigate several cross-lingual semantic reward functions, including those based on embeddings and machine translation. Extensive experiments on a suite of multilingual reasoning benchmarks show that our method significantly narrows the performance gap between English and other languages, substantially outperforming traditional PPO baselines. Specifically, our PB-RLSVR framework improves the average multilingual performance of Llama-3.1-8B-Instruct and Qwen3-32B by 16.41% and 10.17%, respectively, demonstrating a powerful and data-efficient approach to building truly multilingual reasoning agents.
Problem

Research questions and friction points this paper is trying to address.

Addressing performance disparities in multilingual reasoning across languages
Enhancing multilingual reasoning without human-annotated target language data
Transferring English model reasoning capabilities to multilingual models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses English LLM as pivot model for multilingual reasoning
Rewards semantic equivalence to English reference responses
Employs cross-lingual embedding and translation reward functions
🔎 Similar Papers