Estimating Privacy Leakage of Augmented Contextual Knowledge in Language Models

📅 2024-10-03

📈 Citations: 4

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Language models integrated with external context (e.g., in RAG systems) pose potential privacy leakage risks, yet conventional output-similarity-based evaluation methods overestimate such risks due to confounding interference from parametric knowledge. Method: We propose the first *Contextual Influence* metric—grounded in differential privacy principles—to causally quantify the independent privacy contribution of context to model outputs, explicitly disentangling it from parametric knowledge via controlled interventions. Our methodology employs context subset ablation, decoding-level perturbation, and systematic experiments varying model scale, context length, and generation position. Contribution/Results: We identify out-of-distribution context as the primary leakage source; smaller models, longer contexts, and earlier generation positions exhibit significantly higher leakage. The metric enables fine-grained attribution, yielding the first interpretable and quantifiable privacy risk assessment tool for RAG and related retrieval-augmented applications.

Technology Category

Application Category

📝 Abstract

Language models (LMs) rely on their parametric knowledge augmented with relevant contextual knowledge for certain tasks, such as question answering. However, the contextual knowledge can contain private information that may be leaked when answering queries, and estimating this privacy leakage is not well understood. A straightforward approach of directly comparing an LM's output to the contexts can overestimate the privacy risk, since the LM's parametric knowledge might already contain the augmented contextual knowledge. To this end, we introduce $emph{context influence}$, a metric that builds on differential privacy, a widely-adopted privacy notion, to estimate the privacy leakage of contextual knowledge during decoding. Our approach effectively measures how each subset of the context influences an LM's response while separating the specific parametric knowledge of the LM. Using our context influence metric, we demonstrate that context privacy leakage occurs when contextual knowledge is out of distribution with respect to parametric knowledge. Moreover, we experimentally demonstrate how context influence properly attributes the privacy leakage to augmented contexts, and we evaluate how factors-- such as model size, context size, generation position, etc.-- affect context privacy leakage. The practical implications of our results will inform practitioners of the privacy risk associated with augmented contextual knowledge.

Problem

Research questions and friction points this paper is trying to address.

Estimating privacy leakage in language models' augmented contextual knowledge

Differentiating parametric and contextual knowledge influence on privacy risks

Measuring how context subsets affect privacy leakage during model decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces context influence metric

Builds on differential privacy notion

Measures subset influence on responses

🔎 Similar Papers

No similar papers found.