Precise In-Parameter Concept Erasure in Large Language Models

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

To address the challenge of precisely erasing undesirable concept-level knowledge—such as sensitive or copyright-protected content—acquired during pretraining of large language models (LLMs), this paper introduces PISCES, the first concept-level suppression framework operating directly in parameter space via interpretable feature directions. PISCES integrates concept decoupling modeling, MLP-layer-wise vector feature decomposition, automated interpretability analysis, and targeted parameter editing to enable fine-grained, concept-specific erasure—distinct from fact-level or coarse-grained removal. Evaluations on Gemma 2 and Llama 3.1 demonstrate that PISCES reduces target concept accuracy to 7.7%, improves erasure specificity by up to 31%, and enhances robustness by up to 38%, significantly outperforming existing methods.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.

Problem

Research questions and friction points this paper is trying to address.

Removing undesirable knowledge from large language models

Precise erasure of entire concepts in parameter space

Improving specificity and robustness of concept removal

Innovation

Methods, ideas, or system contributions that make the work stand out.

Directly edits concept-encoding parameter directions

Uses disentangler model for interpretable feature decomposition

Automated interpretability identifies target concept features

🔎 Similar Papers

Erasing Conceptual Knowledge from Language Models