Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure

📅 2026-01-15
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing large language model unlearning methods, which often merely suppress output behaviors without truly erasing internal knowledge, leading to residual capabilities. To overcome this, the authors propose the Knowledge Immunization Framework (KIF), which explicitly distinguishes between behavioral suppression and genuine knowledge erasure. KIF achieves authentic forgetting at the representational level by identifying and eliminating internal activation signature patterns associated with target knowledge. Integrating dynamic representation suppression with parameter-efficient fine-tuning, KIF attains near-ideal knowledge removal (FQ ≈ 0.99) across 3B–14B parameter models—including Llama, Mistral, Qwen, and DeepSeek—while preserving high model utility (MU = 0.62) and limiting post-unlearning utility drift to under 3%. The study further introduces a dual evaluation protocol based on activation signatures, effectively reconciling the trade-off between unlearning efficacy and model stability.

Technology Category

Application Category

📝 Abstract
Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.
Problem

Research questions and friction points this paper is trying to address.

selective knowledge erasure
unlearning
activation signatures
model safety
GDPR compliance
Innovation

Methods, ideas, or system contributions that make the work stand out.

activation signatures
knowledge unlearning
representation-aware
parameter-efficient adaptation
latent trace persistence
🔎 Similar Papers
No similar papers found.
S
Syed Naveed Mahmood
Computer Science and Engineering, BRAC University, Dhaka, Bangladesh
M
Md. Rezaur Rahman Bhuiyan
Computer Science and Engineering, BRAC University, Dhaka, Bangladesh
T
Tasfia Zaman
Computer Science and Engineering, BRAC University, Dhaka, Bangladesh
J
Jareen Tasneem Khondaker
Computer Science and Engineering, BRAC University, Dhaka, Bangladesh
M
Md. Sameer Sakib
Computer Science and Engineering, BRAC University, Dhaka, Bangladesh
Nazia Tasnim
Nazia Tasnim
Boston University
PEFT & Model EditingComputer VisionExplainable AIMultimodal Systems
Farig Sadeque
Farig Sadeque
Associate Professor, BRAC University
Natural Language ProcessingComputational Social Science