🤖 AI Summary
Achieving native-level response quality from large language models (LLMs) in multilingual settings remains challenging. This paper introduces MENLO, a novel framework addressing this gap. Methodologically, it first constructs MENLO-47, the first systematically curated multilingual preference dataset covering 47 languages and comprising 6,423 high-agreement, audience-aware human annotations. It then proposes a structured, multidimensional evaluation framework integrated with a generative reward model—trained via reinforcement learning—that jointly leverages pairwise preference ranking, reward shaping, and multitask learning to enable quantifiable, cross-lingual native-quality assessment and optimization. Experimental results demonstrate significant improvements in LLM performance on multilingual quality judgment tasks. The open-sourced dataset and framework establish critical infrastructure for future research in multilingual LLM alignment and evaluation.
📝 Abstract
Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.