🤖 AI Summary
This work systematically evaluates the vulnerability of large language models (LLMs) to data poisoning attacks during preference learning. To this end, we introduce PoisonBench—the first benchmark tailored to this stage—covering 21 mainstream LLMs and eight realistic scenarios, and evaluating two classes of poisoning attacks. Methodologically, we construct poisoned samples from real-world preference data, and employ response bias detection, trigger generalization testing, and cross-model robustness analysis. Key findings include: (i) model parameter count exhibits no positive correlation with poisoning resistance; (ii) poisoning efficacy grows log-linearly with poison ratio; and (iii) attack effects generalize to unseen triggers. Experiments reveal that all evaluated models are significantly vulnerable—even 0.1% poisoned data induces observable harmful outputs—and smaller models are not inherently more susceptible. These results expose fundamental weaknesses in preference learning and establish the first quantitative baseline for defense research.
📝 Abstract
Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.