Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language Models

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multilingual instruction-following benchmarks heavily rely on machine translation, introducing cross-lingual evaluation bias—particularly underestimating model capabilities in low-resource languages. To address this, we propose Marco-Bench-MIF, the first multilingual instruction-following benchmark covering 30 languages with systematic, multi-level localization. Its construction paradigm integrates human localization with rigorous translation verification, explicitly handling language-specific features—including casing, proper nouns, and cultural references. Experiments show that localized data improves accuracy by 7–22% over machine-translated counterparts; scaling model size yields 45–60% performance gains, yet script differences remain a significant challenge. Marco-Bench-MIF reveals pronounced performance disparities between high- and low-resource languages and establishes a new standard for fair, robust multilingual instruction-following evaluation.

Technology Category

Application Category

📝 Abstract
Instruction-following capability has become a major ability to be evaluated for Large Language Models (LLMs). However, existing datasets, such as IFEval, are either predominantly monolingual and centered on English or simply machine translated to other languages, limiting their applicability in multilingual contexts. In this paper, we present an carefully-curated extension of IFEval to a localized multilingual version named Marco-Bench-MIF, covering 30 languages with varying levels of localization. Our benchmark addresses linguistic constraints (e.g., modifying capitalization requirements for Chinese) and cultural references (e.g., substituting region-specific company names in prompts) via a hybrid pipeline combining translation with verification. Through comprehensive evaluation of 20+ LLMs on our Marco-Bench-MIF, we found that: (1) 25-35% accuracy gap between high/low-resource languages, (2) model scales largely impact performance by 45-60% yet persists script-specific challenges, and (3) machine-translated data underestimates accuracy by7-22% versus localized data. Our analysis identifies challenges in multilingual instruction following, including keyword consistency preservation and compositional constraint adherence across languages. Our Marco-Bench-MIF is available at https://github.com/AIDC-AI/Marco-Bench-MIF.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual instruction-following in LLMs
Addressing limitations of monolingual or machine-translated datasets
Assessing localization impact on model performance accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Localized multilingual extension of IFEval
Hybrid pipeline combining translation with verification
Comprehensive evaluation of 20+ LLMs
🔎 Similar Papers
No similar papers found.
Bo Zeng
Bo Zeng
University of Pittsburgh
Chenyang Lyu
Chenyang Lyu
Alibaba
Large Language ModelsNatural Language ProcessingMachine Learning
Sinuo Liu
Sinuo Liu
Alibaba International Digital Commerce
M
Mingyan Zeng
Alibaba International Digital Commerce
M
Minghao Wu
Alibaba International Digital Commerce
X
Xuanfan Ni
Alibaba International Digital Commerce
T
Tianqi Shi
Alibaba International Digital Commerce
Y
Yu Zhao
Alibaba International Digital Commerce
Y
Yefeng Liu
Alibaba International Digital Commerce
C
Chenyu Zhu
Alibaba International Digital Commerce
R
Ruizhe Li
University of Aberdeen
Jiahui Geng
Jiahui Geng
Mohamed bin Zayed University of Artificial Intelligence
Artificial IntelligenceNatural Language Processing
Q
Qing Li
MBZUAI
Y
Yu Tong
Alibaba International Digital Commerce
Longyue Wang
Longyue Wang
Alibaba International
Large Language ModelMachine TranslationNatural Language ProcessingLanguange Agent
Weihua Luo
Weihua Luo
Alibaba
natural language processingmachine learningartificial intelligence
Kaifu Zhang
Kaifu Zhang
Assistant Professor of Marketing, Carnegie Mellon University
Two-sided marketsInternet platformse-commerce