🤖 AI Summary
Existing IFEval benchmarks are English-only, limiting assessment of large language models’ (LLMs) instruction-following capabilities in multilingual and cross-cultural contexts. To address this, we introduce ML-IFEval—the first multilingual instruction-following evaluation benchmark covering French, Japanese, and Spanish. Methodologically, we extend the deterministic, rule-driven IFEval framework to multilingual settings by proposing language-adapted instruction design, cross-lingual consistency verification, and human-AI collaborative validation. Our approach integrates multilingual template engineering with automated rule-based evaluation to ensure objectivity and reproducibility. Experiments across eight state-of-the-art LLMs reveal substantial inter-lingual performance disparities, underscoring the necessity of multilingual evaluation. ML-IFEval thus provides the first open-source, subjectivity-free, and fully reproducible benchmark for internationalized LLM assessment, enabling rigorous, culture-aware evaluation of instruction following across languages.
📝 Abstract
Instruction following is a core capability of modern Large language models (LLMs), making evaluating this capability essential to understanding these models. The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria, offering a measure of LLM performance without subjective AI or human judgement. However, it only includes English instructions, limiting its ability to assess LLMs in other languages. We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions. Applying this benchmark to 8 state-of-the-art LLMs, we find that benchmark performance across languages and instruction types can vary widely, underscoring the importance of a multilingual benchmark for evaluating LLMs in a diverse cultural context.