Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing large language models (LLMs) retain unsafe knowledge that is difficult to eliminate completely; parameter fine-tuning merely suppresses harmful activations without erasing their informational traces, leading to persistent forgetting challenges and vulnerability to relearning attacks. To address this, we propose an irreversible projection transformation mechanism operating in the hidden-layer representation space—introducing structured erasure to machine unlearning for the first time. Specifically, orthogonal projection of hidden states at designated network layers enables permanent removal—not mere suppression—of harmful knowledge. Our method preserves pre-erasure model performance, supports sequential multi-round unlearning, and significantly enhances robustness against relearning attacks. Extensive evaluations across multiple benchmarks demonstrate state-of-the-art performance in forgetting efficacy, task retention rate, and resistance to relearning.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) have demonstrated impressive performance in various domains and tasks, concerns about their safety are becoming increasingly severe. In particular, since models may store unsafe knowledge internally, machine unlearning has emerged as a representative paradigm to ensure model safety. Existing approaches employ various training techniques, such as gradient ascent and negative preference optimization, in attempts to eliminate the influence of undesired data on target models. However, these methods merely suppress the activation of undesired data through parametric training without completely eradicating its informational traces within the model. This fundamental limitation makes it difficult to achieve effective continuous unlearning, rendering these methods vulnerable to relearning attacks. To overcome these challenges, we propose a Metamorphosis Representation Projection (MRP) approach that pioneers the application of irreversible projection properties to machine unlearning. By implementing projective transformations in the hidden state space of specific network layers, our method effectively eliminates harmful information while preserving useful knowledge. Experimental results demonstrate that our approach enables effective continuous unlearning and successfully defends against relearning attacks, achieving state-of-the-art performance in unlearning effectiveness while preserving natural performance. Our code is available in https://github.com/ChengcanWu/MRP.

Problem

Research questions and friction points this paper is trying to address.

Eliminate harmful information traces in LLMs

Achieve effective continuous unlearning capability

Defend against relearning attacks on unlearned content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Metamorphosis Representation Projection for unlearning

Irreversible projection in hidden state space

Projective transformations eliminate harmful information

🔎 Similar Papers

No similar papers found.