🤖 AI Summary
This study investigates the cross-lingual transferability of English reward models (RMs) to non-English languages for improving instruction-following and alignment in multilingual RLHF. We propose a systematic evaluation framework integrating Multilingual RewardBench, representation shift analysis, multilingual instruction fine-tuning, and offline RM comparison. Our first empirical finding shows that untranslated English RMs outperform corresponding monolingual RMs by 3–4% on average across multilingual reward evaluation—demonstrating strong generalization rooted in cross-lingual alignment within the representation space and transferable instruction understanding capabilities. We further validate that this transfer mechanism significantly enhances multilingual instruction-following performance. To foster reproducibility and community advancement, we open-source all code, models, and datasets, establishing a new paradigm for multilingual RLHF.
📝 Abstract
Reinforcement learning with human feedback (RLHF) is shown to largely benefit from precise reward models (RMs). However, recent studies in reward modeling schemes are skewed towards English, limiting the applicability of RLHF in multilingual alignments. In this work, we investigate the cross-lingual transfer of RMs trained in diverse languages, primarily from English. Our experimental results demonstrate the strong cross-lingual transfer of English RMs, exceeding target language RMs by 3~4% average increase in Multilingual RewardBench. Furthermore, we analyze the cross-lingual transfer of RMs through the representation shifts. Finally, we perform multilingual alignment to exemplify how cross-lingual transfer in RM propagates to enhanced multilingual instruction-following capability, along with extensive analyses on off-the-shelf RMs. We release the code, model, and data.