PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies an intrinsic vulnerability in large language models’ (LLMs) refusal mechanisms for privacy-sensitive queries—e.g., public figures’ sexual orientation—that can be systematically bypassed. We propose a lightweight activation steering method: first, linear probing identifies attention heads strongly correlated with refusal behavior; then, guided by a privacy evaluator’s labels, we pinpoint critical neurons and selectively modulate their internal activations. The approach requires no fine-tuning or external prompting. Evaluated on four mainstream LLMs, it achieves ≥95% jailbreak success rates, with over 50% of responses leaking verifiable sensitive information. Crucially, this work provides the first empirical evidence that private knowledge implicitly encoded in LLMs’ internal representations can be stably extracted via minimal, interpretable neural interventions. These findings advance our understanding of the failure boundaries of alignment mechanisms and inform the design of robust privacy-preserving safeguards.

Technology Category

Application Category

📝 Abstract
This paper investigates privacy jailbreaking in LLMs via steering, focusing on whether manipulating activations can bypass LLM alignment and alter response behaviors to privacy related queries (e.g., a certain public figure's sexual orientation). We begin by identifying attention heads predictive of refusal behavior for private attributes (e.g., sexual orientation) using lightweight linear probes trained with privacy evaluator labels. Next, we steer the activations of a small subset of these attention heads guided by the trained probes to induce the model to generate non-refusal responses. Our experiments show that these steered responses often disclose sensitive attribute details, along with other private information about data subjects such as life events, relationships, and personal histories that the models would typically refuse to produce. Evaluations across four LLMs reveal jailbreaking disclosure rates of at least 95%, with more than 50% on average of these responses revealing true personal information. Our controlled study demonstrates that private information memorized in LLMs can be extracted through targeted manipulation of internal activations.
Problem

Research questions and friction points this paper is trying to address.

Investigates privacy jailbreaking in LLMs via activation steering
Examines if manipulating activations bypasses LLM alignment for privacy queries
Demonstrates extraction of private info by targeted internal activation manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Manipulate activations to bypass LLM alignment
Use linear probes to identify refusal behavior
Steer attention heads to disclose private information
🔎 Similar Papers
No similar papers found.
Krishna Kanth Nakka
Krishna Kanth Nakka
EPFL
LLM PrivacyAI SafetyML RobustnessML Interpretability
X
Xue Jiang
Huawei Munich Research Center, Bavaria, Germany
X
Xuebing Zhou
Huawei Munich Research Center, Bavaria, Germany