MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Achieving native-level response quality from large language models (LLMs) in multilingual settings remains challenging. This paper introduces MENLO, a novel framework addressing this gap. Methodologically, it first constructs MENLO-47, the first systematically curated multilingual preference dataset covering 47 languages and comprising 6,423 high-agreement, audience-aware human annotations. It then proposes a structured, multidimensional evaluation framework integrated with a generative reward model—trained via reinforcement learning—that jointly leverages pairwise preference ranking, reward shaping, and multitask learning to enable quantifiable, cross-lingual native-quality assessment and optimization. Experimental results demonstrate significant improvements in LLM performance on multilingual quality judgment tasks. The open-sourced dataset and framework establish critical infrastructure for future research in multilingual LLM alignment and evaluation.

Technology Category

Application Category

📝 Abstract

Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating native-like quality of LLM responses across 47 languages

Creating human-annotated preference dataset for multilingual quality assessment

Improving multilingual proficiency through reinforcement learning and reward modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework operationalizes native-like quality evaluation

Dataset creation with human-annotated preference pairs

Fine-tuning with reinforcement learning and reward shaping

🔎 Similar Papers

Native Design Bias: Studying the Impact of English Nativeness on Language Model Performance

2024-06-25arXiv.orgCitations: 3