🤖 AI Summary
This work addresses the unclear origins of performance disparities across languages in multilingual large language models, which have lacked interpretable diagnostic foundations. The authors propose a two-stage Bayesian hierarchical framework that, for the first time, systematically decomposes the variance in multilingual performance and quantifies the contributions of language identity, model architecture, and evaluation benchmark to understanding and reasoning tasks. Integrating distribution-free hypothesis testing, linguistic typological features, and representational similarity analysis, the framework reveals that linguistic features account for 79% of performance variance in understanding tasks and 92% in reasoning tasks. Further analysis shows that understanding performance is primarily driven by model effects (66.7%), whereas reasoning performance is dominated by interactions between model and benchmark (46.3%), offering an actionable theoretical basis for optimizing multilingual models.
📝 Abstract
Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establish that these gaps are systematic rather than artifacts of sampling noise via distribution-free Friedman and Kruskal--Wallis tests, then introduce a two-step Bayesian hierarchical framework that decomposes multilingual performance variance into interpretable components. First, isolating the variance attributable to language identity, we show that observable language features (script, family, typological distance) explain $R^2_{\text{ling}} = 79\%$ of this variance on understanding tasks and $92\%$ on reasoning, with a model's internal representational similarity to English emerging as the dominant predictor across both task buckets. Second, decomposing the full (model$\times$benchmark$\times$language) cube, we find that NLU and reasoning have fundamentally divergent variance profiles: model identity dominates understanding ($66.7\%$ of variance), whereas the benchmark$\times$model interaction dominates reasoning ($46.3\%$). Together these results recast multilingual evaluation from passive performance mapping into an explainable, diagnostic framework with concrete levers for targeting the root drivers of language disparity.