PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address privacy risks arising from large language models (LLMs) memorizing and leaking personally identifiable information (PII), existing differential privacy or neuron-level interventions often incur significant utility loss and suffer from coarse-grained, limited efficacy. This paper proposes the first explainability-driven feature intervention framework for PII mitigation: (1) PII-enriched layers are localized via feature probing; (2) a k-sparse autoencoder (k-SAE) is introduced to disentangle and identify unambiguous, PII-specific representations; and (3) fine-grained feature ablation and vector-guided intervention are applied. Evaluated on Gemma2-2B and Llama2-7B, our method reduces email leakage rate from 5.15% to 0%, while preserving over 99.4% of model utility—substantially outperforming conventional neuron-level approaches. This work pioneers the integration of sparse autoencoding with interpretability-guided, feature-level intervention, uncovering the structured internal storage mechanism of PII in LLMs.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing but also pose significant privacy risks by memorizing and leaking Personally Identifiable Information (PII). Existing mitigation strategies, such as differential privacy and neuron-level interventions, often degrade model utility or fail to effectively prevent leakage. To address this challenge, we introduce PrivacyScalpel, a novel privacy-preserving framework that leverages LLM interpretability techniques to identify and mitigate PII leakage while maintaining performance. PrivacyScalpel comprises three key steps: (1) Feature Probing, which identifies layers in the model that encode PII-rich representations, (2) Sparse Autoencoding, where a k-Sparse Autoencoder (k-SAE) disentangles and isolates privacy-sensitive features, and (3) Feature-Level Interventions, which employ targeted ablation and vector steering to suppress PII leakage. Our empirical evaluation on Gemma2-2b and Llama2-7b, fine-tuned on the Enron dataset, shows that PrivacyScalpel significantly reduces email leakage from 5.15% to as low as 0.0%, while maintaining over 99.4% of the original model's utility. Notably, our method outperforms neuron-level interventions in privacy-utility trade-offs, demonstrating that acting on sparse, monosemantic features is more effective than manipulating polysemantic neurons. Beyond improving LLM privacy, our approach offers insights into the mechanisms underlying PII memorization, contributing to the broader field of model interpretability and secure AI deployment.
Problem

Research questions and friction points this paper is trying to address.

Addresses privacy risks in LLMs by preventing PII leakage.
Introduces PrivacyScalpel to maintain model utility while enhancing privacy.
Uses interpretable feature intervention to reduce email leakage effectively.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature Probing identifies PII-rich model layers
Sparse Autoencoding isolates privacy-sensitive features
Feature-Level Interventions suppress PII leakage effectively
🔎 Similar Papers
No similar papers found.
Ahmed Frikha
Ahmed Frikha
Cerebras Systems Inc.
Generative MLDomain GeneralizationContinual LearningMultimodal LearningPrivacy-Preserving ML
M
Muhammad Reza Ar Razi
Huawei Munich Research Center
K
K. K. Nakka
Huawei Munich Research Center
Ricardo Mendes
Ricardo Mendes
Huawei Technologies Düsseldorf GmbH
Privacy-Preserving AILocation PrivacyUbiquitous Computing
X
Xue Jiang
Huawei Munich Research Center
X
Xuebing Zhou
Huawei Munich Research Center