Finetuning-Activated Backdoors in LLMs

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

265K/year

🤖 AI Summary

This work introduces a novel “fine-tuning-activated backdoor” attack: adversaries inject stealthy backdoors into open-source large language models (LLMs) during pretraining or initial fine-tuning; these remain dormant and benign under standard inference but are triggered only upon downstream users’ subsequent fine-tuning, enabling malicious behaviors—including ad injection, response refusal, and jailbreaking. Methodologically, the approach pioneers the integration of meta-learning into backdoor construction, jointly optimizing poisoned training, multi-objective regularization, and downstream behavioral modeling to ensure delayed activation and pre-fine-tuning undetectability. Extensive experiments successfully activate all three attack types across multiple LLMs, demonstrating strong robustness against varying downstream datasets, fine-tuning steps, and optimizers. The results fundamentally challenge the widely held “fine-tuning ensures safety” assumption, revealing critical supply-chain vulnerabilities and offering a new perspective on LLM security assurance.

Technology Category

Application Category

📝 Abstract

Finetuning openly accessible Large Language Models (LLMs) has become standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets led to predictable behaviors. In this paper, we demonstrate for the first time that an adversary can create poisoned LLMs that initially appear benign but exhibit malicious behaviors once finetuned by downstream users. To this end, our proposed attack, FAB (Finetuning-Activated Backdoor), poisons an LLM via meta-learning techniques to simulate downstream finetuning, explicitly optimizing for the emergence of malicious behaviors in the finetuned models. At the same time, the poisoned LLM is regularized to retain general capabilities and to exhibit no malicious behaviors prior to finetuning. As a result, when users finetune the seemingly benign model on their own datasets, they unknowingly trigger its hidden backdoor behavior. We demonstrate the effectiveness of FAB across multiple LLMs and three target behaviors: unsolicited advertising, refusal, and jailbreakability. Additionally, we show that FAB-backdoors are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler). Our findings challenge prevailing assumptions about the security of finetuning, revealing yet another critical attack vector exploiting the complexities of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Demonstrates finetuning can activate hidden malicious behaviors in LLMs

Introduces FAB attack using meta-learning to poison seemingly benign models

Challenges security assumptions of finetuning in large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta-learning to poison LLMs for backdoors

Regularization to hide malicious pre-finetuning behaviors

Robust backdoors across various finetuning settings

🔎 Similar Papers

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models