mR3: Multilingual Rubric-Agnostic Reward Reasoning Models

πŸ“… 2025-10-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current large language models (LLMs) exhibit poor generalization of reward modeling across non-English languages and lack efficient multilingual training paradigms. To address this, we propose MLRMβ€”the first general-purpose, criterion-agnostic multilingual reward model covering 72 languages. MLRM integrates multilingual reasoning data, curriculum-based data selection, and knowledge distillation to achieve high performance with lightweight architecture. On a comprehensive multilingual reward modeling benchmark, MLRM outperforms the 120B-parameter GPT-OSS baseline while reducing parameter count by up to 9Γ—. Ablation studies confirm the substantial contributions of each component. Notably, this work establishes the first criterion-free alignment framework operating across an unprecedentedly broad language coverage, enabling scalable, low-resource-dependent evaluation of multilingual LLMs.

Technology Category

Application Category

πŸ“ Abstract
Evaluation using Large Language Model (LLM) judges has been widely adopted in English and shown to be effective for automatic evaluation. However, their performance does not generalize well to non-English settings, and it remains unclear what constitutes effective multilingual training for such judges. In this paper, we introduce mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages, achieving the broadest language coverage in reward modeling to date. We present a comprehensive study of data and curriculum selection for training to identify effective strategies and data sources for building high-quality reward models, including the integration of target-language reasoning datasets. Our approach attains state-of-the-art performance on multilingual reward model benchmarks, surpassing much larger models (i.e., GPT-OSS-120B) while being up to 9x smaller, and its effectiveness is further confirmed through extensive ablation studies. Our models, data, and code are available as open source at https://github.com/rubricreward/mr3.
Problem

Research questions and friction points this paper is trying to address.

Improving multilingual LLM judge performance for evaluation
Developing effective multilingual training strategies for reward models
Creating compact yet powerful multilingual reward reasoning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual reward model trained on 72 languages
Integrates target-language reasoning datasets for training
Achieves state-of-art performance while being smaller
πŸ”Ž Similar Papers
No similar papers found.