Can We Infer Confidential Properties of Training Data from LLMs?

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study investigates whether fine-tuned large language models (LLMs) inadvertently leak sensitive dataset-level attributes—such as patient demographics or disease prevalence—through attribute inference attacks. To this end, we introduce PropInfer, the first benchmark specifically designed for attribute inference against fine-tuned LLMs, covering both instruction-tuning and dialogue-finetuning paradigms. We propose two novel attack methods: (i) a prompt-engineering–based attack that generates adversarial prompts to elicit attribute-revealing responses, and (ii) a word-frequency–driven shadow-model attack tailored to the statistical properties of LLM-generated text. Extensive evaluation on the ChatDoctor dataset across multiple mainstream pre-trained LLMs demonstrates consistent and statistically significant leakage of sensitive attributes—establishing, for the first time, concrete evidence of dataset-level attribute leakage in fine-tuned LLMs. Our core contributions include identifying a previously overlooked vulnerability in LLMs under attribute inference threats and providing a reproducible, paradigm-aware evaluation framework with principled attack methodologies.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties -- such as patient demographics or disease prevalence -- that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Assessing confidentiality risks in LLM training data

Evaluating property inference attacks on fine-tuned LLMs

Identifying vulnerabilities in domain-specific LLM applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

PropInfer benchmark for LLM property inference

Prompt-based and shadow-model attack methods

Evaluates question-answering and chat-completion paradigms

🔎 Similar Papers

Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions