Gandalf the Red: Adaptive Security for LLMs

📅 2025-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM security evaluation methods neglect the dynamic evolution of adversarial behaviors and often over-engineer defenses, compromising legitimate user utility—thus failing to balance security and usability. Method: We propose D-SEC, a dynamic security evaluation framework, and Gandalf, a gamified red-teaming crowdsourcing platform, introducing the first security-utility joint optimization model. We also release the first large-scale (279k instances) adaptive prompt-attack dataset and systematically uncover the latent usability degradation imposed by deep defense layers on legitimate interactions. Contributions/Results: Through dynamic threat modeling, multi-step adversarial simulation, and adaptive response mechanisms, we empirically validate—within constrained domains—that high security and high usability can be simultaneously achieved. We open-source the complete toolchain and dataset to advance standardization in LLM security evaluation.

Technology Category

Application Category

📝 Abstract
Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and rigorously expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack datasets. Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. We demonstrate that restricted application domains, defense-in-depth, and adaptive defenses are effective strategies for building secure and useful LLM applications. Code is available at href{https://github.com/lakeraai/dsec-gandalf}{ exttt{https://github.com/lakeraai/dsec-gandalf}}.
Problem

Research questions and friction points this paper is trying to address.

Large Language Model Security
Adaptive Attack Defense
Usability Trade-off
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Security
Large Language Models
Crowdsourced Attack Data
🔎 Similar Papers
No similar papers found.
Niklas Pfister
Niklas Pfister
Associate Professor, University of Copenhagen
V
V'aclav Volhejn
Lakera
M
Manuel Knott
Lakera
S
Santiago Arias
Lakera
J
Julia Bazi'nska
Lakera
M
Mykhailo Bichurin
Lakera
A
Alan Commike
Lakera
J
Janet Darling
Lakera
P
Peter Dienes
Lakera
M
Matthew Fiedler
Lakera
D
David Haber
Lakera
M
Matthias Kraft
Lakera
M
Marco Lancini
Lakera
M
Max Mathys
Lakera
D
Dami'an Pascual-Ortiz
Lakera
J
Jakub Podolak
Lakera
A
Adria Romero-L'opez
Lakera
Kyriacos Shiarlis
Kyriacos Shiarlis
Waymo
Autonomous DrivingImitation LearningMachine Learning
A
Andreas Signer
Lakera
Z
Z. Terék
Lakera
A
Athanasios Theocharis
Lakera
D
D. Timbrell
Lakera
S
Samuel Trautwein
Lakera
S
Samuel Watts
Lakera
N
Natalie Wu
Lakera
Mateo Rojas-Carulla
Mateo Rojas-Carulla
Lakera AI
Machine LearningArtificial IntelligenceCausal InferenceStatistics