Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Traditional data poisoning attacks against large language models (LLMs) assume adversaries must contaminate a fixed *proportion* of training data—a premise increasingly unrealistic for massive-scale pretraining. Method: This work proposes and validates an ultra-low-overhead backdoor injection paradigm, systematically evaluating across 600M–13B parameter models and up to 260B tokens in both pretraining and fine-tuning stages. Contribution/Results: We demonstrate that only ~250 carefully crafted malicious documents suffice to reliably implant functional backdoors—regardless of model scale or dataset size. Crucially, attack success rates remain stable and do not diminish with increasing parameters or token count, revealing an approximately *constant* poisoning sample requirement—decoupled from dataset magnitude and model capacity. This finding fundamentally challenges proportion-based security assessment frameworks. Ablation studies confirm robustness across configurations, establishing a more realistic and potent benchmark for LLM data poisoning.

Technology Category

Application Category

📝 Abstract

Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples. Finally, we demonstrate the same dynamics for poisoning during fine-tuning. Altogether, our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size, highlighting the need for more research on defences to mitigate this risk in future models.

Problem

Research questions and friction points this paper is trying to address.

Poisoning attacks require constant poison samples across model sizes

Backdoor injection is easier for large models than previously believed

Defense research is needed against scalable poisoning attack methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Poisoning attacks require near-constant poison samples

Largest pretraining poisoning experiments conducted to date

Same dynamics observed for poisoning during fine-tuning

🔎 Similar Papers

No similar papers found.