🤖 AI Summary
Existing LLM privacy evaluation frameworks overemphasize PII detection while neglecting socio-contextual factors and legal compliance requirements. Method: We propose PrivaCI-Bench—the first comprehensive privacy benchmark grounded in Contextual Integrity (CI) theory—integrating statutory provisions, real-world judicial precedents, privacy policies, and official toolkits to generate synthetic data and formalize CI into a computable framework. Our methodology combines legal text parsing, contextual modeling, privacy flow annotation, and controlled synthetic data generation. Contribution/Results: PrivaCI-Bench enables multi-dimensional assessment of LLMs’ privacy reasoning and regulatory compliance decision-making capabilities. Empirical evaluation reveals that state-of-the-art models—including QwQ-32B and DeepSeek-R1—exhibit limited capacity in cross-contextual privacy inference and legally grounded compliance judgments, underscoring both the benchmark’s validity and the critical need for context-aware privacy evaluation.
📝 Abstract
Recent advancements in generative large language models (LLMs) have enabled wider applicability, accessibility, and flexibility. However, their reliability and trustworthiness are still in doubt, especially for concerns regarding individuals' data privacy. Great efforts have been made on privacy by building various evaluation benchmarks to study LLMs' privacy awareness and robustness from their generated outputs to their hidden representations. Unfortunately, most of these works adopt a narrow formulation of privacy and only investigate personally identifiable information (PII). In this paper, we follow the merit of the Contextual Integrity (CI) theory, which posits that privacy evaluation should not only cover the transmitted attributes but also encompass the whole relevant social context through private information flows. We present PrivaCI-Bench, a comprehensive contextual privacy evaluation benchmark targeted at legal compliance to cover well-annotated privacy and safety regulations, real court cases, privacy policies, and synthetic data built from the official toolkit to study LLMs' privacy and safety compliance. We evaluate the latest LLMs, including the recent reasoner models QwQ-32B and Deepseek R1. Our experimental results suggest that though LLMs can effectively capture key CI parameters inside a given context, they still require further advancements for privacy compliance.