Pop Quiz Attack: Black-box Membership Inference Attacks Against Large Language Models

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Large language models (LLMs) may memorize and inadvertently leak training data, posing significant privacy risks. This work proposes a black-box membership inference attack that introduces, for the first time, a question-answering quiz mechanism: by reformulating target samples as multiple-choice questions, the method efficiently determines whether a given sample was part of the training set using only the model’s outputs. Requiring no white-box access, the approach achieves an average ROC-AUC of 0.873 across six prominent LLMs and four datasets, outperforming existing methods by 20.6%. Further evaluation incorporating defense strategies such as instruction tuning, data filtering, and differential privacy reveals that current LLMs remain notably vulnerable to privacy leakage.

📝 Abstract

Large language models (LLMs) show strong performance across many applications, but their ability to memorize and potentially reveal training data raises serious privacy concerns. We introduce the PopQuiz Attack, a black-box membership inference attack that tests whether a model can recall specific training examples. The core idea is to turn target data into quiz-style multiple-choice questions and infer membership from the model's answers. Across six widely used LLMs (GPT-3.5, GPT-4o, LLaMA2-7b, LLaMA2-13b, Mistral-7b, and Vicuna-7b) and four datasets, our method achieves an average ROC-AUC of 0.873 and outperforms existing approaches by 20.6%. We further analyze factors affecting attack success, including query complexity, data type, data structure, and training settings. We also evaluate instruction-based, filter-based, and differential privacy-based defenses, which reduce performance but do not eliminate the risk. Our results highlight persistent privacy vulnerabilities in modern LLMs.

Problem

Research questions and friction points this paper is trying to address.

membership inference attack

large language models

privacy

training data leakage

black-box attack

Innovation

Methods, ideas, or system contributions that make the work stand out.

membership inference attack

black-box attack

large language models