🤖 AI Summary
To address security risks arising from residual sensitive knowledge in large language models (LLMs), this paper proposes a fine-grained activation manipulation framework for precise, efficient, and controllable machine unlearning. Methodologically, it introduces a novel contrastive orthogonal misalignment mechanism to decouple retained and unlearned representations in the latent space; integrates information-theoretic parameter selection, conflict-gradient orthogonal subspace projection, and representation-guided in-training unlearning—thereby overcoming the accuracy limitations of conventional coarse-grained loss-based approaches. Evaluated across multiple benchmarks, the method achieves an average 42% reduction in knowledge recovery rate while incurring less than 1.2% degradation in downstream task performance, significantly enhancing robustness against recovery attacks. This work establishes a new paradigm for secure and controllable model editing.
📝 Abstract
Large language models have been widely applied, but can inadvertently encode sensitive or harmful information, raising significant safety concerns. Machine unlearning has emerged to alleviate this concern; however, existing training-time unlearning approaches, relying on coarse-grained loss combinations, have limitations in precisely separating knowledge and balancing removal effectiveness with model utility. In contrast, we propose Fine-grained Activation manipuLation by Contrastive Orthogonal uNalignment (FALCON), a novel representation-guided unlearning approach that leverages information-theoretic guidance for efficient parameter selection, employs contrastive mechanisms to enhance representation separation, and projects conflict gradients onto orthogonal subspaces to resolve conflicts between forgetting and retention objectives. Extensive experiments demonstrate that FALCON achieves superior unlearning effectiveness while maintaining model utility, exhibiting robust resistance against knowledge recovery attempts.