🤖 AI Summary
Prior research on memorization in multilingual large language models (MLLMs) remains limited, and the prevailing assumption—that memorization strength is determined solely by training data volume—overlooks the critical role of cross-lingual similarity, thereby obscuring true memorization patterns.
Method: The authors conduct the first systematic, large-scale quantification of memorization behavior across 95 languages and propose a novel graph-structured language similarity metric integrating both genealogical and typological features.
Contribution/Results: Empirical analysis reveals that training data volume is not the dominant factor; instead, low-resource languages with high linguistic similarity to high-resource ones exhibit stronger cross-lingual memorization. Language similarity serves not only as a key explanatory variable for memorization but also as a foundational determinant of cross-lingual transfer capability. These findings establish a new theoretical framework and methodological foundation for privacy assessment and robust training of multilingual models.
📝 Abstract
We present the first comprehensive study of Memorization in Multilingual Large Language Models (MLLMs), analyzing 95 languages using models across diverse model scales, architectures, and memorization definitions. As MLLMs are increasingly deployed, understanding their memorization behavior has become critical. Yet prior work has focused primarily on monolingual models, leaving multilingual memorization underexplored, despite the inherently long-tailed nature of training corpora. We find that the prevailing assumption, that memorization is highly correlated with training data availability, fails to fully explain memorization patterns in MLLMs. We hypothesize that treating languages in isolation - ignoring their similarities - obscures the true patterns of memorization. To address this, we propose a novel graph-based correlation metric that incorporates language similarity to analyze cross-lingual memorization. Our analysis reveals that among similar languages, those with fewer training tokens tend to exhibit higher memorization, a trend that only emerges when cross-lingual relationships are explicitly modeled. These findings underscore the importance of a language-aware perspective in evaluating and mitigating memorization vulnerabilities in MLLMs. This also constitutes empirical evidence that language similarity both explains Memorization in MLLMs and underpins Cross-lingual Transferability, with broad implications for multilingual NLP.