Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This study addresses the tendency of large language models to violate contextual privacy norms in high-stakes scenarios by inadvertently disclosing sensitive information that humans would typically handle with caution. Grounded in contextual integrity theory, this work is the first to demonstrate that within the model’s activation space, the three core dimensions of privacy—context, information type, and data subject—are linearly separable and functionally independent. Leveraging this insight, the authors propose a Contextual Integrity (CI) parameterized steering method that intervenes along each dimension independently. Through probing analyses, linear separability validation, and targeted behavioral interventions, they reveal that while models internally encode structured privacy representations, their outputs often misalign with normative expectations. The proposed approach significantly outperforms holistic steering strategies in both efficacy and predictability for mitigating privacy violations.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion. This raises a fundamental question: do LLMs internally encode contextual privacy norms, and if so, why do violations persist? We present the first systematic study of contextual privacy as a structured latent representation in LLMs, grounded in contextual integrity (CI) theory. Probing multiple models, we find that the three norm-determining CI parameters (information type, recipient, and transmission principle) are encoded as linearly separable and functionally independent directions in activation space. Despite this internal structure, models still leak private information in practice, revealing a clear gap between concept representation and model behavior. To bridge this gap, we introduce CI-parametric steering, which independently intervenes along each CI dimension. This structured control reduces privacy violations more effectively and predictably than monolithic steering. Our results demonstrate that contextual privacy failures arise from misalignment between representation and behavior rather than missing awareness, and that leveraging the compositional structure of CI enables more reliable contextual privacy control, shedding light on potential improvement of contextual privacy understanding in LLMs.

Problem

Research questions and friction points this paper is trying to address.

contextual privacy

large language models

privacy violations

contextual integrity

latent representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

contextual integrity

privacy probing

activation space