Uncovering the Potential Risks in Unlearning: Danger of English-only Unlearning in Multilingual LLMs

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This work identifies a critical risk in knowledge erasure for multilingual large language models (LLMs): when trained exclusively on English data for forgetting, full fine-tuning on parallel multilingual corpora induces language confusion—i.e., generation of non-target-language outputs to English prompts—severely distorting conventional reference-based metrics (e.g., BLEU, ROUGE). To address this, the authors propose N-Mix, a novel n-gram–based metric quantifying language mixing degree, enabling the first measurable characterization of confusion severity. Through semantic consistency analysis and cross-lingual baseline comparison, they demonstrate that standard evaluation yields pervasive false negatives. Empirical validation across diverse models and languages confirms the phenomenon’s universality. The study advocates a paradigm shift toward semantics-oriented forgetting evaluation and establishes both theoretical foundations and practical standards for safe, multilingual knowledge erasure.

Technology Category

Application Category

📝 Abstract

There have been a couple of studies showing that attempting to erase multilingual knowledge using only English data is insufficient for multilingual LLMs. However, their analyses remain highly performance-oriented. In this paper, we switch the point of view to evaluation, and address an additional blind spot which reveals itself when the multilingual LLM is fully finetuned with parallel multilingual dataset before unlearning. Here, language confusion occurs whereby a model responds in language different from that of the input prompt. Language confusion is a problematic phenomenon in unlearning, causing the standard reference-based metrics to fail. We tackle this phenomenon in three steps: (1) introduce N-gram-based Language-Mix (N-Mix) score to quantitatively show the language confusion is pervasive and consistent in multilingual LLMs, (2) demonstrate that reference-based metrics result in false negatives when N-Mix score is high, and(3) suggest the need of new type of unlearning evaluation that can directly assess the content of the generated sentences. We call this type of metrics as semantic-based metric.

Problem

Research questions and friction points this paper is trying to address.

Addressing language confusion in multilingual LLMs during unlearning processes

Evaluating limitations of reference-based metrics for unlearning assessment

Proposing semantic-based metrics to properly assess unlearning effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces N-gram-based Language-Mix score

Identifies limitations of reference-based evaluation metrics

Proposes semantic-based metrics for unlearning assessment

🔎 Similar Papers

Learn and Unlearn in Multilingual LLMs