Gandalf the Red: Adaptive Security for LLMs

📅 2025-01-14

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing LLM security evaluation methods neglect the dynamic evolution of adversarial behaviors and often over-engineer defenses, compromising legitimate user utility—thus failing to balance security and usability. Method: We propose D-SEC, a dynamic security evaluation framework, and Gandalf, a gamified red-teaming crowdsourcing platform, introducing the first security-utility joint optimization model. We also release the first large-scale (279k instances) adaptive prompt-attack dataset and systematically uncover the latent usability degradation imposed by deep defense layers on legitimate interactions. Contributions/Results: Through dynamic threat modeling, multi-step adversarial simulation, and adaptive response mechanisms, we empirically validate—within constrained domains—that high security and high usability can be simultaneously achieved. We open-source the complete toolchain and dataset to advance standardization in LLM security evaluation.

Technology Category

Application Category

📝 Abstract

Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and rigorously expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack datasets. Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. We demonstrate that restricted application domains, defense-in-depth, and adaptive defenses are effective strategies for building secure and useful LLM applications. Code is available at href{https://github.com/lakeraai/dsec-gandalf}{ exttt{https://github.com/lakeraai/dsec-gandalf}}.

Problem

Research questions and friction points this paper is trying to address.

Large Language Model Security

Adaptive Attack Defense

Usability Trade-off

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Security

Large Language Models

Crowdsourced Attack Data

🔎 Similar Papers

The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies