AutoBackdoor: Automating Backdoor Attacks via LLM Agents

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing backdoor attacks rely on manually designed triggers and static data pipelines, hindering systematic evaluation of defense robustness. Method: This paper proposes the first large language model (LLM)-agent-driven fully automated backdoor attack framework, integrating autonomous trigger-word generation, context-aware data poisoning, and instruction-tuning-based payload injection. Contribution/Results: The framework enables cross-topic, low-sample (requiring only a few poisoned examples), high-success-rate (>90% across LLaMA-3, Mistral, Qwen, and GPT-4o) stealthy behavior implantation, covering realistic threat scenarios including biased recommendation, hallucination injection, and peer-review manipulation. Experiments demonstrate that state-of-the-art defenses are broadly vulnerable to such semantically coherent, dynamically generated agent-driven attacks—establishing a more rigorous and scalable benchmark for red-teaming and defense research.

Technology Category

Application Category

📝 Abstract

Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs), enabling adversaries to implant hidden behaviors triggered by specific inputs. However, existing methods often rely on manually crafted triggers and static data pipelines, which are rigid, labor-intensive, and inadequate for systematically evaluating modern defense robustness. As AI agents become increasingly capable, there is a growing need for more rigorous, diverse, and scalable extit{red-teaming frameworks} that can realistically simulate backdoor threats and assess model resilience under adversarial conditions. In this work, we introduce extsc{AutoBackdoor}, a general framework for automating backdoor injection, encompassing trigger generation, poisoned data construction, and model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling scalable poisoning across arbitrary topics with minimal human effort. We evaluate AutoBackdoor under three realistic threat scenarios, including extit{Bias Recommendation}, extit{Hallucination Injection}, and extit{Peer Review Manipulation}, to simulate a broad range of attacks. Experiments on both open-source and commercial models, including LLaMA-3, Mistral, Qwen, and GPT-4o, demonstrate that our method achieves over 90% attack success with only a small number of poisoned samples. More importantly, we find that existing defenses often fail to mitigate these attacks, underscoring the need for more rigorous and adaptive evaluation techniques against agent-driven threats as explored in this work. All code, datasets, and experimental configurations will be merged into our primary repository at https://github.com/bboylyg/BackdoorLLM.

Problem

Research questions and friction points this paper is trying to address.

Automating backdoor attacks in large language models using autonomous agents

Addressing limitations of manual trigger creation and static data pipelines

Evaluating model resilience against diverse and scalable backdoor threat scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated backdoor injection using LLM agents

Generates context-aware triggers autonomously

Evaluates attacks across multiple threat scenarios

🔎 Similar Papers

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models