ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models

📅 2025-02-22

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing LLM backdoor attack benchmarks suffer from incomplete coverage, inconsistent evaluation metrics, and low practical feasibility. To address these limitations, this paper introduces ELBA-Bench—the first efficient learning-oriented backdoor attack benchmark tailored for large language models. It supports both parameter-efficient fine-tuning (e.g., LoRA) and tuning-free paradigms (e.g., in-context learning), encompassing 12 attack methods, 18 datasets, and 12 mainstream LLMs. Innovatively, it aligns attack paradigms, evaluation metrics, and model scales; proposes task-relevant trigger optimization and mixed-demonstration enhancement to jointly maximize attack success rate and clean-sample performance. Based on over 1,300 experiments, we find that PEFT-based attacks significantly outperform tuning-free approaches on classification tasks, and trigger optimization enhances robustness. ELBA-Bench delivers a standardized, reproducible, and extensible evaluation framework, advancing the standardization and practicality of LLM backdoor security research.

Technology Category

Application Category

📝 Abstract

Generative large language models are crucial in natural language processing, but they are vulnerable to backdoor attacks, where subtle triggers compromise their behavior. Although backdoor attacks against LLMs are constantly emerging, existing benchmarks remain limited in terms of sufficient coverage of attack, metric system integrity, backdoor attack alignment. And existing pre-trained backdoor attacks are idealized in practice due to resource access constraints. Therefore we establish $ extit{ELBA-Bench}$, a comprehensive and unified framework that allows attackers to inject backdoor through parameter efficient fine-tuning ($ extit{e.g.,}$ LoRA) or without fine-tuning techniques ($ extit{e.g.,}$ In-context-learning). $ extit{ELBA-Bench}$ provides over 1300 experiments encompassing the implementations of 12 attack methods, 18 datasets, and 12 LLMs. Extensive experiments provide new invaluable findings into the strengths and limitations of various attack strategies. For instance, PEFT attack consistently outperform without fine-tuning approaches in classification tasks while showing strong cross-dataset generalization with optimized triggers boosting robustness; Task-relevant backdoor optimization techniques or attack prompts along with clean and adversarial demonstrations can enhance backdoor attack success while preserving model performance on clean samples. Additionally, we introduce a universal toolbox designed for standardized backdoor attack research, with the goal of propelling further progress in this vital area.

Problem

Research questions and friction points this paper is trying to address.

Assess vulnerability of large language models to backdoor attacks.

Develop a comprehensive benchmark for evaluating backdoor attack strategies.

Enhance backdoor attack effectiveness through parameter efficient fine-tuning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameter-efficient fine-tuning techniques

In-context-learning without fine-tuning

Universal toolbox for standardized research

🔎 Similar Papers

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models