Layer-Targeted Multilingual Knowledge Erasure in Large Language Models

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study addresses the challenge of cross-lingual knowledge retention in multilingual large language models, where information erased in one language remains accessible through others. The authors systematically investigate how the depth of intervention affects unlearning efficacy, revealing that shallow-layer modifications degrade multilingual capabilities while deep-layer approaches fail to fully eliminate target knowledge. To overcome this, they propose performing targeted unlearning in language-agnostic intermediate layers. By leveraging Centered Kernel Alignment (CKA) and the Linguistic Regions Development Score (LRDS), they identify layers that encode universal linguistic representations, further validating their approach via Logit Lens analysis. Evaluated across three mainstream model architectures and unlearning algorithms, the method achieves complete cross-lingual knowledge removal by fine-tuning only a minimal set of source-language parameters, establishing intervention depth as a critical determinant of successful multilingual unlearning.

Technology Category

Application Category

📝 Abstract

Recent work has demonstrated that machine unlearning in Large Language Models (LLMs) fails to generalize across languages: knowledge erased in one language frequently remains accessible through others. However, the underlying cause of this failure and a principled solution remain open. In this work, we identify intervention depth as the key factor determining multilingual generalization. Through systematic layer-wise experiments, we characterize two distinct failure modes: shallow-layer interventions achieve erasure but collapse multilingual capabilities in held-out languages, while deep-layer interventions preserve utility but fail to erase target knowledge even in source languages. These findings reveal that the choice of intervention layer is not a free parameter; it fundamentally determines whether multilingual unlearning succeeds. We propose MUTE (Multilingual Unlearning via Targeted Erasure), a framework that uses Centered Kernel Alignment (CKA) and Linguistic Regions Development Score (LRDS) to identify intermediate, language-agnostic layers where cross-lingual representations converge. By restricting unlearning updates to these layers, MUTE achieves robust multilingual knowledge erasure while optimizing on only a small set of source languages. Extensive experiments across three LLM architectures and three unlearning algorithms validate our approach, with mechanistic analysis via Logit Lens probing confirming genuine knowledge removal rather than output-level suppression.

Problem

Research questions and friction points this paper is trying to address.

multilingual knowledge erasure

machine unlearning

large language models

cross-lingual generalization

knowledge removal

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual unlearning

intervention depth

language-agnostic representations