DEPART: DEcomposing PARiTy across Multilingual LLMs

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the unclear origins of performance disparities across languages in multilingual large language models, which have lacked interpretable diagnostic foundations. The authors propose a two-stage Bayesian hierarchical framework that, for the first time, systematically decomposes the variance in multilingual performance and quantifies the contributions of language identity, model architecture, and evaluation benchmark to understanding and reasoning tasks. Integrating distribution-free hypothesis testing, linguistic typological features, and representational similarity analysis, the framework reveals that linguistic features account for 79% of performance variance in understanding tasks and 92% in reasoning tasks. Further analysis shows that understanding performance is primarily driven by model effects (66.7%), whereas reasoning performance is dominated by interactions between model and benchmark (46.3%), offering an actionable theoretical basis for optimizing multilingual models.

📝 Abstract

Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establish that these gaps are systematic rather than artifacts of sampling noise via distribution-free Friedman and Kruskal--Wallis tests, then introduce a two-step Bayesian hierarchical framework that decomposes multilingual performance variance into interpretable components. First, isolating the variance attributable to language identity, we show that observable language features (script, family, typological distance) explain $R^2_{\text{ling}} = 79\%$ of this variance on understanding tasks and $92\%$ on reasoning, with a model's internal representational similarity to English emerging as the dominant predictor across both task buckets. Second, decomposing the full (model$\times$benchmark$\times$language) cube, we find that NLU and reasoning have fundamentally divergent variance profiles: model identity dominates understanding ($66.7\%$ of variance), whereas the benchmark$\times$model interaction dominates reasoning ($46.3\%$). Together these results recast multilingual evaluation from passive performance mapping into an explainable, diagnostic framework with concrete levers for targeting the root drivers of language disparity.

Problem

Research questions and friction points this paper is trying to address.

multilingual LLMs

performance disparity

language bias

systematic variance

multilingual evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual LLMs

Bayesian hierarchical modeling

performance variance decomposition