Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This paper identifies and demonstrates the first backdoor attack targeting large language model (LLM) merging—termed *Merge Hijacking*—where malicious actors inject stealthy backdoors by uploading poisoned models, causing merged models to exhibit covert behaviors while preserving multi-task performance. Method: We propose the first backdoor attack paradigm specifically designed for model merging, formulating a dual-objective optimization framework that jointly maximizes backdoor effectiveness and functional utility. Our approach introduces a task-agnostic trigger mechanism based on parameter-space perturbation, integrating gradient masking and utility constraints to ensure compatibility with mainstream merging algorithms—including TIES, DARE, and SLERP. Results: Extensive experiments on LLaMA-2, Qwen, Phi-3, and real-world open-source models validate the attack’s efficacy. Crucially, Merge Hijacking exhibits strong robustness against three representative defenses: paraphrasing, CLEAN-GEN, and fine-pruning.

Technology Category

Application Category

📝 Abstract

Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks, creating a unified model for multi-domain tasks. However, due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks. In this paper, we propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs. The attacker constructs a malicious upload model and releases it. Once a victim user merges it with any other models, the resulting merged model inherits the backdoor while maintaining utility across tasks. Merge Hijacking defines two main objectives-effectiveness and utility-and achieves them through four steps. Extensive experiments demonstrate the effectiveness of our attack across different models, merging algorithms, and tasks. Additionally, we show that the attack remains effective even when merging real-world models. Moreover, our attack demonstrates robustness against two inference-time defenses (Paraphrasing and CLEANGEN) and one training-time defense (Fine-pruning).

Problem

Research questions and friction points this paper is trying to address.

Backdoor attacks target model merging in LLMs

Malicious models compromise merged model integrity

Attack bypasses defenses across diverse scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

First backdoor attack on LLM model merging

Malicious upload model triggers backdoor post-merging

Robust against multiple defense mechanisms

🔎 Similar Papers

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models