🤖 AI Summary
This study addresses the severe privacy risks posed by LLM-driven automated deep user profiling, highlighting a critical gap between user privacy concerns and platform practices. The authors propose PrivacyIceberg, a three-tiered framework that exposes how platforms inadequately address public privacy expectations at both technical and policy levels. They develop IcebergExplorer, a tool that reconstructs high-fidelity user profiles with over 90% factual accuracy within ten minutes and at a cost under three dollars, using only minimal personally identifiable information (PII) as seed input. By integrating contextual reasoning and deep aggregation techniques with empirical auditing and qualitative analysis, the study identifies six root causes of privacy leakage and offers actionable, collaborative mitigation strategies for LLM developers, users, and data publishers.
📝 Abstract
Large Language Models (LLMs) have revolutionized how information are collected, aggregated, and reasoned. However, this enables a novel and accessible vector of privacy intrusion: the automated and in-depth personal profiling; this engenders a chilling effect of "peepers everywhere". Existing research primarily unfolds from the training pipeline of LLM, emphasizing the exposure of Personally Identifiable Information (PII) through memorization, while privacy studies from a human-centric perspective remain underexplored. To fill this void, we empirically investigate privacy perception in the real world through the lens of human awareness and the practices of LLM-integrated platforms, revealing a significant dissonance: platforms fail to technically or policy-wise address public privacy concerns. To facilitate a systematic and quantifiable study of privacy risk, we propose the PrivacyIceberg, which categorizes real-world human privacy risks into three tiers: explicitly searched, contextually inferred, and deeply aggregated, based on the sophistication of LLM exploitation. We developed IcebergExplorer to audit privacy exposure, utilizing minimal PII as a search seed to reconstruct high-fidelity profiles, achieving over 90% factual accuracy within 10 minutes at a cost under $3, for real-world scenarios. Additionally, we identify six root causes contributing to such privacy disclosures and propose multi-stakeholder countermeasures for LLM vendors, individuals, and data publishers.