π€ AI Summary
Large language models (LLMs) may memorize and inadvertently leak training data, posing significant privacy risks. This work proposes a black-box membership inference attack that introduces, for the first time, a question-answering quiz mechanism: by reformulating target samples as multiple-choice questions, the method efficiently determines whether a given sample was part of the training set using only the modelβs outputs. Requiring no white-box access, the approach achieves an average ROC-AUC of 0.873 across six prominent LLMs and four datasets, outperforming existing methods by 20.6%. Further evaluation incorporating defense strategies such as instruction tuning, data filtering, and differential privacy reveals that current LLMs remain notably vulnerable to privacy leakage.
π Abstract
Large language models (LLMs) show strong performance across many applications, but their ability to memorize and potentially reveal training data raises serious privacy concerns. We introduce the PopQuiz Attack, a black-box membership inference attack that tests whether a model can recall specific training examples. The core idea is to turn target data into quiz-style multiple-choice questions and infer membership from the model's answers. Across six widely used LLMs (GPT-3.5, GPT-4o, LLaMA2-7b, LLaMA2-13b, Mistral-7b, and Vicuna-7b) and four datasets, our method achieves an average ROC-AUC of 0.873 and outperforms existing approaches by 20.6%. We further analyze factors affecting attack success, including query complexity, data type, data structure, and training settings. We also evaluate instruction-based, filter-based, and differential privacy-based defenses, which reduce performance but do not eliminate the risk. Our results highlight persistent privacy vulnerabilities in modern LLMs.