MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Achieving native-level response quality from large language models (LLMs) in multilingual settings remains challenging. This paper introduces MENLO, a novel framework addressing this gap. Methodologically, it first constructs MENLO-47, the first systematically curated multilingual preference dataset covering 47 languages and comprising 6,423 high-agreement, audience-aware human annotations. It then proposes a structured, multidimensional evaluation framework integrated with a generative reward model—trained via reinforcement learning—that jointly leverages pairwise preference ranking, reward shaping, and multitask learning to enable quantifiable, cross-lingual native-quality assessment and optimization. Experimental results demonstrate significant improvements in LLM performance on multilingual quality judgment tasks. The open-sourced dataset and framework establish critical infrastructure for future research in multilingual LLM alignment and evaluation.

Technology Category

Application Category

📝 Abstract
Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating native-like quality of LLM responses across 47 languages
Creating human-annotated preference dataset for multilingual quality assessment
Improving multilingual proficiency through reinforcement learning and reward modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework operationalizes native-like quality evaluation
Dataset creation with human-annotated preference pairs
Fine-tuning with reinforcement learning and reward shaping
🔎 Similar Papers
No similar papers found.
Chenxi Whitehouse
Chenxi Whitehouse
Research Scientist at Meta
Natural Language Processing
Sebastian Ruder
Sebastian Ruder
Research Scientist, Meta
Natural Language ProcessingMachine LearningDeep LearningArtificial Intelligence
T
Tony Zhiyang Lin
Meta Superintelligence Labs
O
Oksana Kurylo
Meta Superintelligence Labs
H
Haruka Takagi
Meta Superintelligence Labs
J
Janice Lam
Meta Superintelligence Labs
N
Nicolò Busetto
Meta Superintelligence Labs
D
Denise Diaz
Meta Superintelligence Labs