🤖 AI Summary
This work addresses privacy and copyright risks in large language model (LLM) training by proposing a black-box dataset membership inference method that detects whether a target LLM was trained on a specific victim dataset solely from its text outputs. The method constructs two local reference models—trained with and without the victim data—and infers membership by comparing semantic similarity, response distribution divergence, and statistical significance between the target model’s outputs and those of the reference models. Unlike conventional gray-box approaches requiring access to internal model features (e.g., hidden-layer activations), our method operates purely via input–output queries, eliminating the need for model internals and substantially improving practical deployability. Experiments on real-world LLMs demonstrate high inference accuracy across diverse configurations and strong robustness against evasion strategies such as prompt engineering and output perturbation.
📝 Abstract
Today, the training of large language models (LLMs) can involve personally identifiable information and copyrighted material, incurring dataset misuse. To mitigate the problem of dataset misuse, this paper explores extit{dataset inference}, which aims to detect if a suspect model $mathcal{M}$ used a victim dataset $mathcal{D}$ in training. Previous research tackles dataset inference by aggregating results of membership inference attacks (MIAs) -- methods to determine whether individual samples are a part of the training dataset. However, restricted by the low accuracy of MIAs, previous research mandates grey-box access to $mathcal{M}$ to get intermediate outputs (probabilities, loss, perplexity, etc.) for obtaining satisfactory results. This leads to reduced practicality, as LLMs, especially those deployed for profits, have limited incentives to return the intermediate outputs.
In this paper, we propose a new method of dataset inference with only black-box access to the target model (i.e., assuming only the text-based responses of the target model are available). Our method is enabled by two sets of locally built reference models, one set involving $mathcal{D}$ in training and the other not. By measuring which set of reference model $mathcal{M}$ is closer to, we determine if $mathcal{M}$ used $mathcal{D}$ for training. Evaluations of real-world LLMs in the wild show that our method offers high accuracy in all settings and presents robustness against bypassing attempts.