Exposing Privacy Gaps: Membership Inference Attack on Preference Data for LLM Alignment

📅 2024-07-08

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 1

career value

182K/year

🤖 AI Summary

This work identifies membership inference attack (MIA) privacy risks in preference-based alignment of large language models (LLMs), specifically under Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO). We theoretically prove that DPO is inherently more vulnerable to MIA than PPO, establishing the first provable distinction in their privacy leakage with respect to preference data. To address this, we propose PREMIA—a reference-based MIA framework tailored for preference-aligned models—integrating gradient/output features, reference-model contrastive analysis, and unified white-box/black-box attack strategies. Experiments demonstrate that PREMIA significantly outperforms existing MIAs on DPO-aligned models, achieving substantial accuracy gains, while PPO-aligned models exhibit markedly stronger privacy robustness. This study provides the first systematic characterization of the privacy–alignment trade-off in preference learning, offering both theoretical foundations and a practical evaluation toolkit for secure LLM alignment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have seen widespread adoption due to their remarkable natural language capabilities. However, when deploying them in real-world settings, it is important to align LLMs to generate texts according to acceptable human standards. Methods such as Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) have enabled significant progress in refining LLMs using human preference data. However, the privacy concerns inherent in utilizing such preference data have yet to be adequately studied. In this paper, we investigate the vulnerability of LLMs aligned using two widely used methods - DPO and PPO - to membership inference attacks (MIAs). Our study has two main contributions: first, we theoretically motivate that DPO models are more vulnerable to MIA compared to PPO models; second, we introduce a novel reference-based attack framework specifically for analyzing preference data called PREMIA (uline{Pre}ference data uline{MIA}). Using PREMIA and existing baselines we empirically show that DPO models have a relatively heightened vulnerability towards MIA.

Problem

Research questions and friction points this paper is trying to address.

Investigates privacy risks in LLM alignment using preference data

Compares vulnerability of DPO and PPO models to membership inference attacks

Introduces PREMIA framework for analyzing preference data privacy gaps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Investigates MIA vulnerability in DPO and PPO models

Introduces PREMIA for preference data MIA analysis

Shows DPO models are more vulnerable to MIA

🔎 Similar Papers

Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding