Blackbox Dataset Inference for LLM

📅 2025-07-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses privacy and copyright risks in large language model (LLM) training by proposing a black-box dataset membership inference method that detects whether a target LLM was trained on a specific victim dataset solely from its text outputs. The method constructs two local reference models—trained with and without the victim data—and infers membership by comparing semantic similarity, response distribution divergence, and statistical significance between the target model’s outputs and those of the reference models. Unlike conventional gray-box approaches requiring access to internal model features (e.g., hidden-layer activations), our method operates purely via input–output queries, eliminating the need for model internals and substantially improving practical deployability. Experiments on real-world LLMs demonstrate high inference accuracy across diverse configurations and strong robustness against evasion strategies such as prompt engineering and output perturbation.

Technology Category

Application Category

📝 Abstract
Today, the training of large language models (LLMs) can involve personally identifiable information and copyrighted material, incurring dataset misuse. To mitigate the problem of dataset misuse, this paper explores extit{dataset inference}, which aims to detect if a suspect model $mathcal{M}$ used a victim dataset $mathcal{D}$ in training. Previous research tackles dataset inference by aggregating results of membership inference attacks (MIAs) -- methods to determine whether individual samples are a part of the training dataset. However, restricted by the low accuracy of MIAs, previous research mandates grey-box access to $mathcal{M}$ to get intermediate outputs (probabilities, loss, perplexity, etc.) for obtaining satisfactory results. This leads to reduced practicality, as LLMs, especially those deployed for profits, have limited incentives to return the intermediate outputs. In this paper, we propose a new method of dataset inference with only black-box access to the target model (i.e., assuming only the text-based responses of the target model are available). Our method is enabled by two sets of locally built reference models, one set involving $mathcal{D}$ in training and the other not. By measuring which set of reference model $mathcal{M}$ is closer to, we determine if $mathcal{M}$ used $mathcal{D}$ for training. Evaluations of real-world LLMs in the wild show that our method offers high accuracy in all settings and presents robustness against bypassing attempts.
Problem

Research questions and friction points this paper is trying to address.

Detect if a model used a specific dataset in training
Improve dataset inference with only black-box access
Enhance accuracy and robustness against bypassing attempts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Blackbox dataset inference method
Uses locally built reference models
Measures model proximity to references
🔎 Similar Papers
No similar papers found.
R
Ruikai Zhou
University of Utah
K
Kang Yang
University of Utah
X
Xun Chen
Samsung Research of America
Wendy Hui Wang
Wendy Hui Wang
Stevens Institute of Technology
Securityprivacyrobustnessfairness of machine learning
Guanhong Tao
Guanhong Tao
Assistant Professor, University of Utah
Machine LearningComputer Security
J
Jun Xu
University of Utah