Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies and demonstrates the first backdoor attack targeting large language model (LLM) merging—termed *Merge Hijacking*—where malicious actors inject stealthy backdoors by uploading poisoned models, causing merged models to exhibit covert behaviors while preserving multi-task performance. Method: We propose the first backdoor attack paradigm specifically designed for model merging, formulating a dual-objective optimization framework that jointly maximizes backdoor effectiveness and functional utility. Our approach introduces a task-agnostic trigger mechanism based on parameter-space perturbation, integrating gradient masking and utility constraints to ensure compatibility with mainstream merging algorithms—including TIES, DARE, and SLERP. Results: Extensive experiments on LLaMA-2, Qwen, Phi-3, and real-world open-source models validate the attack’s efficacy. Crucially, Merge Hijacking exhibits strong robustness against three representative defenses: paraphrasing, CLEAN-GEN, and fine-pruning.

Technology Category

Application Category

📝 Abstract
Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks, creating a unified model for multi-domain tasks. However, due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks. In this paper, we propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs. The attacker constructs a malicious upload model and releases it. Once a victim user merges it with any other models, the resulting merged model inherits the backdoor while maintaining utility across tasks. Merge Hijacking defines two main objectives-effectiveness and utility-and achieves them through four steps. Extensive experiments demonstrate the effectiveness of our attack across different models, merging algorithms, and tasks. Additionally, we show that the attack remains effective even when merging real-world models. Moreover, our attack demonstrates robustness against two inference-time defenses (Paraphrasing and CLEANGEN) and one training-time defense (Fine-pruning).
Problem

Research questions and friction points this paper is trying to address.

Backdoor attacks target model merging in LLMs
Malicious models compromise merged model integrity
Attack bypasses defenses across diverse scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

First backdoor attack on LLM model merging
Malicious upload model triggers backdoor post-merging
Robust against multiple defense mechanisms
Zenghui Yuan
Zenghui Yuan
Huazhong University of Science and Technology
AI SecurityBackdoor
Y
Yangming Xu
Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology
Jiawen Shi
Jiawen Shi
Huazhong University of Science and Technology
AI Security
P
Pan Zhou
Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology
L
Lichao Sun
Lehigh University