SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Frequent jailbreaking attacks against large language models (LLMs) and fragmented evaluation criteria hinder progress in prompt security research. Method: This paper introduces the first systematic framework for prompt security, featuring a multi-level taxonomy of attacks and defenses, formalized threat models and cost assumptions, machine-readable safety evaluation profiles, and an open-source benchmarking toolchain. Contributions: (1) We release JAILBREAKDB—the largest human-verified dataset of jailbreaking and benign prompts to date, containing over 120,000 samples; (2) we establish the first open, reproducible, and auditable standardized evaluation benchmark for prompt security; and (3) we conduct a unified, cross-method assessment and ranking of 56 state-of-the-art attack and defense techniques. The framework significantly enhances comparability and reproducibility across studies, providing foundational infrastructure for rigorous, scalable prompt security research.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have rapidly become integral to real-world applications, powering services across diverse sectors. However, their widespread deployment has exposed critical security risks, particularly through jailbreak prompts that can bypass model alignment and induce harmful outputs. Despite intense research into both attack and defense techniques, the field remains fragmented: definitions, threat models, and evaluation criteria vary widely, impeding systematic progress and fair comparison. In this Systematization of Knowledge (SoK), we address these challenges by (1) proposing a holistic, multi-level taxonomy that organizes attacks, defenses, and vulnerabilities in LLM prompt security; (2) formalizing threat models and cost assumptions into machine-readable profiles for reproducible evaluation; (3) introducing an open-source evaluation toolkit for standardized, auditable comparison of attacks and defenses; (4) releasing JAILBREAKDB, the largest annotated dataset of jailbreak and benign prompts to date; and (5) presenting a comprehensive evaluation and leaderboard of state-of-the-art methods. Our work unifies fragmented research, provides rigorous foundations for future studies, and supports the development of robust, trustworthy LLMs suitable for high-stakes deployment.

Problem

Research questions and friction points this paper is trying to address.

Addressing fragmented research in LLM prompt security risks and attacks

Standardizing threat models and evaluation criteria for jailbreak defenses

Providing tools and datasets for reproducible security assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes multi-level taxonomy for LLM prompt security

Introduces open-source toolkit for standardized evaluation

Releases largest annotated dataset of jailbreak prompts

🔎 Similar Papers

Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization