Blackbox Dataset Inference for LLM

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses privacy and copyright risks in large language model (LLM) training by proposing a black-box dataset membership inference method that detects whether a target LLM was trained on a specific victim dataset solely from its text outputs. The method constructs two local reference models—trained with and without the victim data—and infers membership by comparing semantic similarity, response distribution divergence, and statistical significance between the target model’s outputs and those of the reference models. Unlike conventional gray-box approaches requiring access to internal model features (e.g., hidden-layer activations), our method operates purely via input–output queries, eliminating the need for model internals and substantially improving practical deployability. Experiments on real-world LLMs demonstrate high inference accuracy across diverse configurations and strong robustness against evasion strategies such as prompt engineering and output perturbation.

Technology Category

Application Category

📝 Abstract

Today, the training of large language models (LLMs) can involve personally identifiable information and copyrighted material, incurring dataset misuse. To mitigate the problem of dataset misuse, this paper explores extit{dataset inference}, which aims to detect if a suspect model $mathcal{M}$ used a victim dataset $mathcal{D}$ in training. Previous research tackles dataset inference by aggregating results of membership inference attacks (MIAs) -- methods to determine whether individual samples are a part of the training dataset. However, restricted by the low accuracy of MIAs, previous research mandates grey-box access to $mathcal{M}$ to get intermediate outputs (probabilities, loss, perplexity, etc.) for obtaining satisfactory results. This leads to reduced practicality, as LLMs, especially those deployed for profits, have limited incentives to return the intermediate outputs. In this paper, we propose a new method of dataset inference with only black-box access to the target model (i.e., assuming only the text-based responses of the target model are available). Our method is enabled by two sets of locally built reference models, one set involving $mathcal{D}$ in training and the other not. By measuring which set of reference model $mathcal{M}$ is closer to, we determine if $mathcal{M}$ used $mathcal{D}$ for training. Evaluations of real-world LLMs in the wild show that our method offers high accuracy in all settings and presents robustness against bypassing attempts.

Problem

Research questions and friction points this paper is trying to address.

Detect if a model used a specific dataset in training

Improve dataset inference with only black-box access

Enhance accuracy and robustness against bypassing attempts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Blackbox dataset inference method

Uses locally built reference models

Measures model proximity to references

🔎 Similar Papers

ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models