The Model's Language Matters: A Comparative Privacy Analysis of LLMs

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how linguistic structure influences privacy leakage risks in multilingual large language models (LLMs), focusing on models trained on medical corpora in English, Spanish, French, and Italian. We employ three privacy attack paradigms—extraction attacks, counterfactual memorization, and membership inference—alongside six quantitative linguistic metrics (e.g., redundancy, tokenization granularity, morphological complexity) to systematically characterize the relationship between language properties and privacy vulnerability. Our analysis reveals that higher linguistic redundancy and coarser tokenization granularity correlate positively with increased privacy leakage; English models exhibit the strongest membership distinguishability and highest risk; morphologically richer French and Spanish models demonstrate greater privacy robustness; and Italian models suffer the most severe leakage. This work establishes the first linguistically grounded, interpretable framework for assessing and comparing privacy risks across multilingual LLMs.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly deployed across multilingual applications that handle sensitive data, yet their scale and linguistic variability introduce major privacy risks. Mostly evaluated for English, this paper investigates how language structure affects privacy leakage in LLMs trained on English, Spanish, French, and Italian medical corpora. We quantify six linguistic indicators and evaluate three attack vectors: extraction, counterfactual memorization, and membership inference. Results show that privacy vulnerability scales with linguistic redundancy and tokenization granularity: Italian exhibits the strongest leakage, while English shows higher membership separability. In contrast, French and Spanish display greater resilience due to higher morphological complexity. Overall, our findings provide the first quantitative evidence that language matters in privacy leakage, underscoring the need for language-aware privacy-preserving mechanisms in LLM deployments.
Problem

Research questions and friction points this paper is trying to address.

Analyzing how language structure affects privacy leakage in multilingual LLMs
Quantifying privacy vulnerability through linguistic redundancy and tokenization granularity
Demonstrating language-specific privacy risks requiring tailored protection mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantify linguistic indicators affecting privacy leakage
Evaluate three attack vectors across multiple languages
Propose language-aware privacy-preserving mechanisms for LLMs