🤖 AI Summary
This work addresses the limitation of existing unlearning methods for large language models (LLMs), which primarily suppress target information at the output layer but fail to disentangle forgotten and retained knowledge in the representation space. To overcome this, we propose CLReg, a contrastive representation regularization approach that, for the first time, introduces contrastive learning into LLM unlearning. CLReg explicitly separates forgotten and retained features in the latent space, achieving representational disentanglement. Theoretical analysis reveals a direct link between such disentanglement and improved unlearning efficacy, surpassing the constraints of conventional methods that operate solely in the prediction space. Experiments demonstrate that CLReg significantly reduces feature entanglement across multiple benchmarks, consistently enhances the performance of mainstream unlearning techniques, and does so without introducing additional privacy risks.
📝 Abstract
Most LLM unlearning methods aim to approximate retrain-from-scratch behaviors with minimal distribution shift, often via alignment-style objectives defined in the prediction space. While effective at reducing forgotten content generation, such approaches may act as suppression: forgotten concepts can persist in representations and remain entangled with retained knowledge. We introduce CLReg, a contrastive representation regularizer that identifies forget features while pushing them away from retain features, explicitly reducing forget-retain interference with minimal shifts on retain features. We provide first theoretical insights that relate representation shaping to entanglement reduction. Across unlearning benchmarks and LLMs of different sizes, CLReg decreases forget-retain representation entanglement that facilitates mainstream unlearning methods without positing extra privacy risks, inspiring future work that reshapes the representation space to remove forget concepts.