🤖 AI Summary
This paper identifies and demonstrates the first backdoor attack targeting large language model (LLM) merging—termed *Merge Hijacking*—where malicious actors inject stealthy backdoors by uploading poisoned models, causing merged models to exhibit covert behaviors while preserving multi-task performance.
Method: We propose the first backdoor attack paradigm specifically designed for model merging, formulating a dual-objective optimization framework that jointly maximizes backdoor effectiveness and functional utility. Our approach introduces a task-agnostic trigger mechanism based on parameter-space perturbation, integrating gradient masking and utility constraints to ensure compatibility with mainstream merging algorithms—including TIES, DARE, and SLERP.
Results: Extensive experiments on LLaMA-2, Qwen, Phi-3, and real-world open-source models validate the attack’s efficacy. Crucially, Merge Hijacking exhibits strong robustness against three representative defenses: paraphrasing, CLEAN-GEN, and fine-pruning.
📝 Abstract
Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks, creating a unified model for multi-domain tasks. However, due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks. In this paper, we propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs. The attacker constructs a malicious upload model and releases it. Once a victim user merges it with any other models, the resulting merged model inherits the backdoor while maintaining utility across tasks. Merge Hijacking defines two main objectives-effectiveness and utility-and achieves them through four steps. Extensive experiments demonstrate the effectiveness of our attack across different models, merging algorithms, and tasks. Additionally, we show that the attack remains effective even when merging real-world models. Moreover, our attack demonstrates robustness against two inference-time defenses (Paraphrasing and CLEANGEN) and one training-time defense (Fine-pruning).