🤖 AI Summary
This study investigates implicit privacy bias in large language model (LLM) training data—systematic deviations in models’ judgments of information flow appropriateness across social contexts. We propose the first theoretically grounded evaluation framework for privacy bias, built upon Contextual Integrity theory. Our method employs cross-model response comparison, prompt robustness control, and statistical bias detection to isolate privacy-specific biases while mitigating confounding effects from prompt variation. Experiments across mainstream LLMs reveal significant and inconsistent privacy judgment biases, indicating a fundamental lack of systematic modeling of privacy norms in training data. Key contributions include: (1) the first theory-driven privacy bias assessment paradigm; (2) empirical evidence that improved model capability and optimization may exacerbate—not alleviate—privacy judgment inaccuracies; and (3) an interpretable, empirically validated evaluation tool to advance trustworthy AI governance.
📝 Abstract
As LLMs are integrated into sociotechnical systems, it is crucial to examine the privacy biases they exhibit. A privacy bias refers to the skew in the appropriateness of information flows within a given context that LLMs acquire from large amounts of non-publicly available training data. This skew may either align with existing expectations or signal a symptom of systemic issues reflected in the training datasets. We formulate a novel research question: how can we examine privacy biases in the training data of LLMs? We present a novel approach to assess the privacy biases using a contextual integrity-based methodology to evaluate the responses from different LLMs. Our approach accounts for the sensitivity of responses across prompt variations, which hinders the evaluation of privacy biases. We investigate how privacy biases are affected by model capacities and optimizations.