LLM Jailbreak Oracle

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This paper formally defines the “jailbreak oracle” problem: given a large language model, an input prompt, and a decoding strategy, determine whether there exists a jailbreak response with probability exceeding a specified threshold. As this problem is NP-hard, we propose Boa—the first dedicated algorithm for its efficient solution—integrating rejection-pattern recognition, block-structured breadth-first sampling, and fine-grained safety-guided depth-first search, augmented by probabilistic path pruning and a multi-stage hybrid search (BFS + DFS). Boa enables verifiable and reproducible safety vulnerability detection. Under extreme adversarial conditions, it supports formal model safety certification, rigorous evaluation of defense mechanisms, and standardized comparative red-teaming. By unifying theoretical soundness with engineering practicality, Boa significantly advances risk assessment for large language models in safety-critical applications.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) become increasingly deployed in safety-critical applications, the lack of systematic methods to assess their vulnerability to jailbreak attacks presents a critical security gap. We introduce the jailbreak oracle problem: given a model, prompt, and decoding strategy, determine whether a jailbreak response can be generated with likelihood exceeding a specified threshold. This formalization enables a principled study of jailbreak vulnerabilities. Answering the jailbreak oracle problem poses significant computational challenges -- the search space grows exponentially with the length of the response tokens. We present Boa, the first efficient algorithm for solving the jailbreak oracle problem. Boa employs a three-phase search strategy: (1) constructing block lists to identify refusal patterns, (2) breadth-first sampling to identify easily accessible jailbreaks, and (3) depth-first priority search guided by fine-grained safety scores to systematically explore promising low-probability paths. Boa enables rigorous security assessments including systematic defense evaluation, standardized comparison of red team attacks, and model certification under extreme adversarial conditions.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM vulnerability to jailbreak attacks systematically

Solving the jailbreak oracle problem efficiently

Enabling rigorous security assessments for LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-phase search strategy for jailbreak detection

Breadth-first sampling to find easy jailbreaks

Depth-first priority search guided by safety scores

🔎 Similar Papers

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

2024-04-12arXiv.orgCitations: 3

💼 Related Jobs

Research Scientist, Interpretability

Anthropic

$350,000—$850,000 USD

San Francisco, CA, USA / remote (case-by-case basis)

Authors to Follow