Operationalizing Data Minimization for Privacy-Preserving LLM Prompting

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Large language model (LLM) applications often induce excessive user disclosure of personal information, posing significant privacy risks. Method: We propose a novel data minimization paradigm that identifies and enables disclosure of the minimal necessary sensitive information while preserving response utility. We formally define a data minimization operator framework, construct a privacy-ordered transformation space, and design a priority-queue-based tree search algorithm to automatically locate the optimal disclosure point. Contribution/Results: Experiments reveal that state-of-the-art LLMs exhibit substantial tolerance to information masking—GPT-5 maintains response utility even when 85.7% of sensitive content is masked, markedly outperforming Qwen2.5-0.5B (19.3%). This demonstrates a positive correlation between model scale and data minimization capability. Our work bridges a critical gap in privacy-aware LLM evaluation and establishes a scalable methodology for privacy-enhanced human–AI interaction.

Technology Category

Application Category

📝 Abstract

The rapid deployment of large language models (LLMs) in consumer applications has led to frequent exchanges of personal information. To obtain useful responses, users often share more than necessary, increasing privacy risks via memorization, context-based personalization, or security breaches. We present a framework to formally define and operationalize data minimization: for a given user prompt and response model, quantifying the least privacy-revealing disclosure that maintains utility, and we propose a priority-queue tree search to locate this optimal point within a privacy-ordered transformation space. We evaluated the framework on four datasets spanning open-ended conversations (ShareGPT, WildChat) and knowledge-intensive tasks with single-ground-truth answers (CaseHold, MedQA), quantifying achievable data minimization with nine LLMs as the response model. Our results demonstrate that larger frontier LLMs can tolerate stronger data minimization while maintaining task quality than smaller open-source models (85.7% redaction for GPT-5 vs. 19.3% for Qwen2.5-0.5B). By comparing with our search-derived benchmarks, we find that LLMs struggle to predict optimal data minimization directly, showing a bias toward abstraction that leads to oversharing. This suggests not just a privacy gap, but a capability gap: models may lack awareness of what information they actually need to solve a task.

Problem

Research questions and friction points this paper is trying to address.

Defining minimal data disclosure for privacy-preserving LLM prompts

Quantifying optimal privacy-utility tradeoff in prompt transformations

Addressing LLM bias toward oversharing through search benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework defines minimal privacy-revealing prompt disclosure

Priority-queue tree search locates optimal privacy-utility balance

Quantifies data minimization across multiple LLMs and datasets

🔎 Similar Papers

PII-Compass: Guiding LLM training data extraction prompts towards the target PII via grounding