🤖 AI Summary
Existing multilingual LLM evaluation benchmarks lack cross-lingual alignment, resulting in fragmented assessments of both language coverage and skill acquisition. To address this, we introduce MuBench—a standardized benchmark spanning 61 languages—and propose a cross-lingually aligned evaluation framework with the Multilingual Consistency (MLC) metric, which overcomes the limitations of conventional accuracy in diagnosing performance bottlenecks. Through controlled ablation studies and large-scale pretraining analyses, we systematically characterize how language proportion and parallel data volume affect cross-lingual transfer. Experiments reveal substantial performance gaps for low-resource languages and, using a 1.2B-parameter model, identify key drivers of cross-lingual generalization. MuBench is the first multilingual LLM benchmark offering comprehensive breadth (61 languages), depth (fine-grained skill evaluation), and interpretability (via MLC and controlled analysis), thereby enabling rigorous, comparable, and actionable assessment of multilingual capabilities.
📝 Abstract
Multilingual large language models (LLMs) are advancing rapidly, with new models frequently claiming support for an increasing number of languages. However, existing evaluation datasets are limited and lack cross-lingual alignment, leaving assessments of multilingual capabilities fragmented in both language and skill coverage. To address this, we introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities. We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage, particularly a persistent performance disparity between English and low-resource languages. Leveraging MuBench's alignment, we propose Multilingual Consistency (MLC) as a complementary metric to accuracy for analyzing performance bottlenecks and guiding model improvement. Finally, we pretrain a suite of 1.2B-parameter models on English and Chinese with 500B tokens, varying language ratios and parallel data proportions to investigate cross-lingual transfer dynamics.