MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing multilingual LLM evaluation benchmarks lack cross-lingual alignment, resulting in fragmented assessments of both language coverage and skill acquisition. To address this, we introduce MuBench—a standardized benchmark spanning 61 languages—and propose a cross-lingually aligned evaluation framework with the Multilingual Consistency (MLC) metric, which overcomes the limitations of conventional accuracy in diagnosing performance bottlenecks. Through controlled ablation studies and large-scale pretraining analyses, we systematically characterize how language proportion and parallel data volume affect cross-lingual transfer. Experiments reveal substantial performance gaps for low-resource languages and, using a 1.2B-parameter model, identify key drivers of cross-lingual generalization. MuBench is the first multilingual LLM benchmark offering comprehensive breadth (61 languages), depth (fine-grained skill evaluation), and interpretability (via MLC and controlled analysis), thereby enabling rigorous, comparable, and actionable assessment of multilingual capabilities.

Technology Category

Application Category

📝 Abstract

Multilingual large language models (LLMs) are advancing rapidly, with new models frequently claiming support for an increasing number of languages. However, existing evaluation datasets are limited and lack cross-lingual alignment, leaving assessments of multilingual capabilities fragmented in both language and skill coverage. To address this, we introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities. We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage, particularly a persistent performance disparity between English and low-resource languages. Leveraging MuBench's alignment, we propose Multilingual Consistency (MLC) as a complementary metric to accuracy for analyzing performance bottlenecks and guiding model improvement. Finally, we pretrain a suite of 1.2B-parameter models on English and Chinese with 500B tokens, varying language ratios and parallel data proportions to investigate cross-lingual transfer dynamics.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual LLMs across 61 languages comprehensively

Addressing gaps in claimed vs actual language coverage

Investigating cross-lingual transfer dynamics in model pretraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MuBench for 61-language evaluation

Proposes Multilingual Consistency metric

Pretrains models with varied language ratios

🔎 Similar Papers

No similar papers found.