Multi-lingual Functional Evaluation for Large Language Models

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing static multilingual benchmarks (e.g., Belebele, M-MMLU, M-GSM) poorly reflect large language models’ real-world cross-lingual functional capabilities and robustness. Method: We introduce two novel functional evaluation benchmarks—CL-GSM Symbolic (mathematical reasoning) and CL-IFEval (instruction following)—covering French, Spanish, Hindi, Arabic, and Yoruba, constructed via systematic translation and cultural adaptation from English originals. Contribution/Results: Experiments reveal substantial performance gaps: 17–24% between M-GSM and CL-GSM Symbolic, and 15–24% between Belebele and CL-IFEval, whereas M-MMLU and CL-IFEval differ by only 0.5–3%, confirming that static knowledge-based benchmarks overestimate functional proficiency. Our work is the first to empirically identify cross-lingual performance stability differences—Arabic and English exhibit the highest robustness—and establishes a more realistic, task-aligned paradigm for multilingual capability evaluation.

Technology Category

Application Category

📝 Abstract
Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there's a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-lingual LLM performance beyond static benchmarks
Creating functional benchmarks for practical multi-lingual model assessment
Analyzing model robustness variations across diverse languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created multi-lingual functional benchmarks
Translated benchmarks to five languages
Evaluated model robustness across languages
🔎 Similar Papers
No similar papers found.