Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

250K/year
🤖 AI Summary
Existing machine unlearning methods often fail after model quantization, struggling to achieve persistent and structurally sound knowledge removal. This work proposes MANSU, a novel approach that integrates causal circuit attribution with nullspace projection to precisely erase target knowledge while preserving irrelevant information, maintaining stable unlearning performance even under 4-bit quantization. MANSU is the first method to simultaneously satisfy four critical desiderata: meaningful forgetting, invariance of retained knowledge, absence of quantization-induced rebound, and structural erasure. To distinguish between mere behavioral suppression and genuine structural removal, the authors introduce Circuit Attribution Divergence (CAD). Extensive experiments across multiple model families and risk benchmarks demonstrate that MANSU significantly outperforms existing baselines, whereas conventional gradient-based methods exhibit forgetting degradation—recovering up to 0.05 in accuracy post-quantization.
📝 Abstract
Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.
Problem

Research questions and friction points this paper is trying to address.

machine unlearning
quantization
forgetting permanence
sparsity-permanence tradeoff
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

machine unlearning
quantization
circuit attribution
null-space projection
structural erasure
🔎 Similar Papers
No similar papers found.