Boundary Point Jailbreaking of Black-Box LLMs

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automated jailbreaking attacks struggle to bypass classifier-based safety defenses of large language models under fully black-box conditions. This work proposes Boundary Point Jailbreaking (BPJ), a method that relies solely on single-bit feedback from the classifier and leverages curriculum learning to decompose harmful objectives into intermediate subgoals. BPJ actively selects boundary points most sensitive to attack strength for optimization, requiring neither human-crafted seed prompts nor white-box access. It is the first approach to fully automate the generation of universal jailbreak prompts without any manual intervention, successfully breaching industrial-grade defenses such as Constitutional AI and GPT-5 input classifiers. The results expose a fundamental vulnerability in defense mechanisms that rely solely on single-interaction detection, underscoring the necessity of batch-level monitoring to enhance security.

Technology Category

Application Category

📝 Abstract
Frontier LLMs are safeguarded against attempts to extract harmful information via adversarial prompts known as"jailbreaks". Recently, defenders have developed classifier-based systems that have survived thousands of hours of human red teaming. We introduce Boundary Point Jailbreaking (BPJ), a new class of automated jailbreak attacks that evade the strongest industry-deployed safeguards. Unlike previous attacks that rely on white/grey-box assumptions (such as classifier scores or gradients) or libraries of existing jailbreaks, BPJ is fully black-box and uses only a single bit of information per query: whether or not the classifier flags the interaction. To achieve this, BPJ addresses the core difficulty in optimising attacks against robust real-world defences: evaluating whether a proposed modification to an attack is an improvement. Instead of directly trying to learn an attack for a target harmful string, BPJ converts the string into a curriculum of intermediate attack targets and then actively selects evaluation points that best detect small changes in attack strength ("boundary points"). We believe BPJ is the first fully automated attack algorithm that succeeds in developing universal jailbreaks against Constitutional Classifiers, as well as the first automated attack algorithm that succeeds against GPT-5's input classifier without relying on human attack seeds. BPJ is difficult to defend against in individual interactions but incurs many flags during optimisation, suggesting that effective defence requires supplementing single-interaction methods with batch-level monitoring.
Problem

Research questions and friction points this paper is trying to address.

jailbreaking
black-box attack
LLM safety
adversarial prompts
boundary points
Innovation

Methods, ideas, or system contributions that make the work stand out.

Boundary Point Jailbreaking
black-box attack
Constitutional Classifiers
automated jailbreak
LLM safety
🔎 Similar Papers
No similar papers found.
Xander Davies
Xander Davies
UK AI Security Institute
G
Giorgi Giglemiani
UK AI Security Institute
E
Edmund Lau
UK AI Security Institute
E
Eric Winsor
UK AI Security Institute
G
Geoffrey Irving
UK AI Security Institute
Yarin Gal
Yarin Gal
Professor of Machine Learning, University of Oxford
Machine LearningArtificial IntelligenceProbability TheoryStatistics