Measuring the Depth of LLM Unlearning via Activation Patching

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work addresses the lack of generalizable, fine-grained evaluation metrics for assessing whether target knowledge has been thoroughly erased from the internal representations of large language models. To this end, the authors propose UDS, the first universal white-box forgetting evaluation metric that requires neither auxiliary training nor data adaptation. UDS leverages interpretability techniques—including activation patching, baseline model comparison, and cross-layer knowledge localization—to quantify the mechanistic depth of forgetting and outputs a normalized forgetting depth score in the range [0, 1]. In a large-scale meta-evaluation encompassing eight forgetting methods, 150 models, and 20 existing metrics, UDS demonstrates significantly superior faithfulness and robustness compared to current approaches and reveals, for the first time, the heterogeneity of forgetting depth across individual samples.

📝 Abstract

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score

Problem

Research questions and friction points this paper is trying to address.

LLM unlearning

knowledge erasure

activation patching

unlearning evaluation

internal representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unlearning Depth Score

activation patching

white-box evaluation