PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Language models (LMs) may inadvertently leak personally identifiable information (PII) from training data during inference; existing defenses such as differential privacy (DP) mitigate leakage but severely degrade model utility. Method: We propose a precise intervention framework grounded in computational circuit analysis: (1) adapting causal circuit discovery to identify neural pathways critical for PII leakage within LMs, and (2) applying representation engineering and parameter-level editing—without retraining—to surgically patch these pathways. Contribution/Results: Our approach preserves model performance while substantially suppressing PII leakage. Evaluated across multiple LMs, it reduces PII leakage recall by up to 65%. When combined with DP, residual leakage drops to 0.01%, outperforming state-of-the-art methods. This marks the first application of causal circuit analysis to PII leakage mitigation and demonstrates that targeted, post-hoc interventions can achieve strong privacy-utility trade-offs.

Technology Category

Application Category

📝 Abstract

Language models (LMs) may memorize personally identifiable information (PII) from training data, enabling adversaries to extract it during inference. Existing defense mechanisms such as differential privacy (DP) reduce this leakage, but incur large drops in utility. Based on a comprehensive study using circuit discovery to identify the computational circuits responsible PII leakage in LMs, we hypothesize that specific PII leakage circuits in LMs should be responsible for this behavior. Therefore, we propose PATCH (Privacy-Aware Targeted Circuit PatcHing), a novel approach that first identifies and subsequently directly edits PII circuits to reduce leakage. PATCH achieves better privacy-utility trade-off than existing defenses, e.g., reducing recall of PII leakage from LMs by up to 65%. Finally, PATCH can be combined with DP to reduce recall of residual leakage of an LM to as low as 0.01%. Our analysis shows that PII leakage circuits persist even after the application of existing defense mechanisms. In contrast, PATCH can effectively mitigate their impact.

Problem

Research questions and friction points this paper is trying to address.

Mitigating PII leakage from language models

Improving privacy-utility trade-off in defenses

Directly editing identified PII leakage circuits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies PII leakage circuits via discovery

Directly edits circuits to reduce leakage

Combines with DP for enhanced privacy protection

🔎 Similar Papers

Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions