Align-then-Unlearn: Embedding Alignment for LLM Unlearning

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) may retain sensitive information during training, posing privacy and copyright risks. Existing token-level unlearning methods suffer from poor robustness—especially against prompt rewriting attacks—and fail to achieve concept-level forgetting. Method: We propose a selective knowledge removal framework operating in semantic embedding space, introducing the first “alignment–unlearning” two-stage paradigm: (1) semantic alignment training to build a context-embedding prediction module; (2) adversarial embedding decoupling via cosine-similarity-guided fine-tuning to deeply disentangle target concepts. The method enables parameter-efficient adaptation. Results: Our approach achieves >92% target knowledge forgetting rate across multiple benchmarks, with <1.5% overall performance degradation. It significantly outperforms prior token-level methods and is the first to jointly achieve high forgetting efficacy and strong robustness at the conceptual level.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) are trained on massive datasets, they have raised significant privacy and ethical concerns due to their potential to inadvertently retain sensitive information. Unlearning seeks to selectively remove specific data from trained models, such as personal information or copyrighted content. Current approaches targeting specific output sequences at the token level often fail to achieve complete forgetting and remain susceptible to prompt rephrasing. We propose Align-then-Unlearn, a novel framework that performs unlearning in the semantic embedding space rather than directly on output tokens. Align-then-Unlearn first augments the LLM with an embedding prediction module trained to anticipate future context representations. Unlearning is then achieved by fine-tuning the model to minimize the similarity between these predicted embeddings and a target embedding that represents the concept to be removed. Initial results show that Align-then-Unlearn effectively removes targeted knowledge with minimal degradation in overall model utility. These findings suggest that embedding-based unlearning offers a promising and robust approach to removing conceptual knowledge. Our code is available at https://github.com/ExplainableML/align-then-unlearn.
Problem

Research questions and friction points this paper is trying to address.

Remove sensitive data from LLMs to address privacy concerns
Overcome limitations of token-level unlearning methods
Achieve robust conceptual forgetting in semantic embedding space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unlearning in semantic embedding space
Augments LLM with embedding prediction
Minimizes similarity to target embedding
🔎 Similar Papers
No similar papers found.