Discovering Universal Activation Directions for PII Leakage in Language Models

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the unclear mechanisms by which current language models internally represent and regulate the leakage of personally identifiable information (PII). We propose UniLeak, a novel framework that reveals— for the first time—that PII leakage can be uniformly characterized by a single, universal latent direction within the residual stream. By linearly steering activations along this direction during inference in an unsupervised manner, our method consistently amplifies PII generation probabilities across diverse contexts, without requiring any training data or ground-truth PII labels. Integrating mechanistic interpretability analysis, self-generated text–guided direction search, and label-free PII detection, UniLeak significantly increases PII leakage rates across multiple models and datasets, outperforming existing prompt engineering approaches while largely preserving output quality.

Technology Category

Application Category

📝 Abstract

Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model's residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to existing prompt-based extraction methods. Our results offer a new perspective on PII leakage: the superposition of a latent signal in the model's representations, enabling both risk amplification and mitigation.

Problem

Research questions and friction points this paper is trying to address.

PII leakage

language models

privacy

activation directions

mechanistic interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

universal activation directions

PII leakage

mechanistic interpretability