M-IFEval: Multilingual Instruction-Following Evaluation

📅 2025-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing IFEval benchmarks are English-only, limiting assessment of large language models’ (LLMs) instruction-following capabilities in multilingual and cross-cultural contexts. To address this, we introduce ML-IFEval—the first multilingual instruction-following evaluation benchmark covering French, Japanese, and Spanish. Methodologically, we extend the deterministic, rule-driven IFEval framework to multilingual settings by proposing language-adapted instruction design, cross-lingual consistency verification, and human-AI collaborative validation. Our approach integrates multilingual template engineering with automated rule-based evaluation to ensure objectivity and reproducibility. Experiments across eight state-of-the-art LLMs reveal substantial inter-lingual performance disparities, underscoring the necessity of multilingual evaluation. ML-IFEval thus provides the first open-source, subjectivity-free, and fully reproducible benchmark for internationalized LLM assessment, enabling rigorous, culture-aware evaluation of instruction following across languages.

Technology Category

Application Category

📝 Abstract
Instruction following is a core capability of modern Large language models (LLMs), making evaluating this capability essential to understanding these models. The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria, offering a measure of LLM performance without subjective AI or human judgement. However, it only includes English instructions, limiting its ability to assess LLMs in other languages. We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions. Applying this benchmark to 8 state-of-the-art LLMs, we find that benchmark performance across languages and instruction types can vary widely, underscoring the importance of a multilingual benchmark for evaluating LLMs in a diverse cultural context.
Problem

Research questions and friction points this paper is trying to address.

Evaluates multilingual instruction-following in LLMs
Expands IFEval to French, Japanese, Spanish
Assesses LLM performance across diverse languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Instruction Following Evaluation
Expands to French, Japanese, Spanish
Assesses LLMs in diverse cultures
🔎 Similar Papers
No similar papers found.