Fewer Weights, More Problems: A Practical Attack on LLM Pruning

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work uncovers, for the first time, a critical security vulnerability in large language model (LLM) pruning: adversaries can stealthily implant malicious behaviors into the original model that remain undetected under standard evaluation but are activated post-pruning—triggering jailbreaks, instruction refusal, or targeted content injection—using common pruning methods (e.g., Magnitude, Wanda, SparseGPT). We propose a novel attack paradigm driven by parameter-level proxy metrics, which models pruning probability to guide precise parameter injection and cancellation mechanisms for controlled behavioral manipulation. Evaluated across five mainstream LLMs, our attack achieves 95.7% jailbreak success rate, 98.7% instruction refusal rate, and 99.5% targeted content injection success rate. These results demonstrate a systemic security risk inherent in model compression pipelines and establish a rigorous benchmark and urgent cautionary insight for trustworthy model compression.

Technology Category

Application Category

📝 Abstract

Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to $95.7%$ for jailbreak, $98.7%$ for benign instruction refusal, and $99.5%$ for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.

Problem

Research questions and friction points this paper is trying to address.

Modern LLM pruning methods can be maliciously exploited by adversaries

Adversaries create models appearing benign but exhibiting malicious behaviors after pruning

Attack achieves high success rates across jailbreak and content injection scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversary predicts pruning likelihood of model parameters

Injects malicious behavior into parameters unlikely pruned

Repairs model using parameters likely pruned to hide attack

🔎 Similar Papers

BlockPruner: Fine-grained Pruning for Large Language Models