Shared Path: Unraveling Memorization in Multilingual LLMs through Language Similarities

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior research on memorization in multilingual large language models (MLLMs) remains limited, and the prevailing assumption—that memorization strength is determined solely by training data volume—overlooks the critical role of cross-lingual similarity, thereby obscuring true memorization patterns. Method: The authors conduct the first systematic, large-scale quantification of memorization behavior across 95 languages and propose a novel graph-structured language similarity metric integrating both genealogical and typological features. Contribution/Results: Empirical analysis reveals that training data volume is not the dominant factor; instead, low-resource languages with high linguistic similarity to high-resource ones exhibit stronger cross-lingual memorization. Language similarity serves not only as a key explanatory variable for memorization but also as a foundational determinant of cross-lingual transfer capability. These findings establish a new theoretical framework and methodological foundation for privacy assessment and robust training of multilingual models.

Technology Category

Application Category

📝 Abstract
We present the first comprehensive study of Memorization in Multilingual Large Language Models (MLLMs), analyzing 95 languages using models across diverse model scales, architectures, and memorization definitions. As MLLMs are increasingly deployed, understanding their memorization behavior has become critical. Yet prior work has focused primarily on monolingual models, leaving multilingual memorization underexplored, despite the inherently long-tailed nature of training corpora. We find that the prevailing assumption, that memorization is highly correlated with training data availability, fails to fully explain memorization patterns in MLLMs. We hypothesize that treating languages in isolation - ignoring their similarities - obscures the true patterns of memorization. To address this, we propose a novel graph-based correlation metric that incorporates language similarity to analyze cross-lingual memorization. Our analysis reveals that among similar languages, those with fewer training tokens tend to exhibit higher memorization, a trend that only emerges when cross-lingual relationships are explicitly modeled. These findings underscore the importance of a language-aware perspective in evaluating and mitigating memorization vulnerabilities in MLLMs. This also constitutes empirical evidence that language similarity both explains Memorization in MLLMs and underpins Cross-lingual Transferability, with broad implications for multilingual NLP.
Problem

Research questions and friction points this paper is trying to address.

Analyzing memorization patterns in multilingual large language models across 95 languages
Investigating the impact of language similarity on cross-lingual memorization behavior
Proposing a graph-based metric to model memorization considering language relationships
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based metric analyzes cross-lingual memorization
Incorporates language similarity for memorization patterns
Links memorization to cross-lingual transferability empirically
🔎 Similar Papers
No similar papers found.