Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

📅 2024-06-17

🏛️ arXiv.org

📈 Citations: 26

✨ Influential: 1

career value

153K/year

🤖 AI Summary

Existing LLM unlearning evaluations rely solely on behavioral testing, overlooking residual knowledge at the parameter level—enabling adversarial recovery of supposedly deleted information. This work proposes the first intrinsic unlearning evaluation framework grounded in parameter-level knowledge traces: it localizes “concept vectors” via vocabulary projection, constructs the open-source benchmark ConceptVectors, and systematically models knowledge traces in Llama-2 and Phi-3. We find that mainstream unlearning methods merely suppress—not erase—concept vectors; direct ablation of these vectors fully removes associated knowledge and drastically reduces adversarial recovery success rates. Our study establishes a new paradigm for parameter-level unlearning assessment, exposes fundamental limitations of behavioral evaluation, and advances unlearning research from black-box testing toward mechanistic interpretability. Code and the ConceptVectors benchmark are publicly released.

Technology Category

Application Category

📝 Abstract

The task of"unlearning"certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance in mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general evaluation methodology that leverages vocabulary projections to inspect concepts encoded in model parameters. We use this approach to localize"concept vectors"- parameter vectors that encode concrete concepts - and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors and mostly suppress them during inference, while directly ablating these vectors demonstrably removes the associated knowledge and significantly reduces the model's susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parameter-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.

Problem

Research questions and friction points this paper is trying to address.

Evaluating unlearning methods beyond behavioral tests

Detecting residual knowledge in model parameters post-unlearning

Localizing concept vectors to assess knowledge removal effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parametric knowledge traces for unlearning evaluation

Vocabulary projections to localize concept vectors

Direct ablation of vectors removes knowledge effectively

🔎 Similar Papers

A Unified Framework for Continual Learning and Unlearning