🤖 AI Summary
This work addresses the limitation of existing large language model unlearning methods, which often merely suppress output behaviors without truly erasing internal knowledge, leading to residual capabilities. To overcome this, the authors propose the Knowledge Immunization Framework (KIF), which explicitly distinguishes between behavioral suppression and genuine knowledge erasure. KIF achieves authentic forgetting at the representational level by identifying and eliminating internal activation signature patterns associated with target knowledge. Integrating dynamic representation suppression with parameter-efficient fine-tuning, KIF attains near-ideal knowledge removal (FQ ≈ 0.99) across 3B–14B parameter models—including Llama, Mistral, Qwen, and DeepSeek—while preserving high model utility (MU = 0.62) and limiting post-unlearning utility drift to under 3%. The study further introduces a dual evaluation protocol based on activation signatures, effectively reconciling the trade-off between unlearning efficacy and model stability.
📝 Abstract
Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.