🤖 AI Summary
Existing static multilingual benchmarks (e.g., Belebele, M-MMLU, M-GSM) poorly reflect large language models’ real-world cross-lingual functional capabilities and robustness. Method: We introduce two novel functional evaluation benchmarks—CL-GSM Symbolic (mathematical reasoning) and CL-IFEval (instruction following)—covering French, Spanish, Hindi, Arabic, and Yoruba, constructed via systematic translation and cultural adaptation from English originals. Contribution/Results: Experiments reveal substantial performance gaps: 17–24% between M-GSM and CL-GSM Symbolic, and 15–24% between Belebele and CL-IFEval, whereas M-MMLU and CL-IFEval differ by only 0.5–3%, confirming that static knowledge-based benchmarks overestimate functional proficiency. Our work is the first to empirically identify cross-lingual performance stability differences—Arabic and English exhibit the highest robustness—and establishes a more realistic, task-aligned paradigm for multilingual capability evaluation.
📝 Abstract
Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there's a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.