Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an efficient and deployable defense system against universal jailbreak attacks targeting large language models. By integrating a context-aware switching classification mechanism, a lightweight two-stage cascaded architecture, and a hybrid design combining linear probes with external classifiers, the approach achieves a synergistic optimization of security and computational efficiency. Extensive red-teaming evaluations spanning over 1,700 hours demonstrate that the system successfully blocks all known universal jailbreak attacks while reducing computational overhead by 40× compared to existing methods. Moreover, it attains a remarkably low false rejection rate of only 0.05% in production environments, significantly outperforming current state-of-the-art defenses in both robustness and practicality.

Technology Category

Application Category

📝 Abstract
We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses. Our system combines several key insights. First, we develop exchange classifiers that evaluate model responses in their full conversational context, which addresses vulnerabilities in last-generation systems that examine outputs in isolation. Second, we implement a two-stage classifier cascade where lightweight classifiers screen all traffic and escalate only suspicious exchanges to more expensive classifiers. Third, we train efficient linear probe classifiers and ensemble them with external classifiers to simultaneously improve robustness and reduce computational costs. Together, these techniques yield a production-grade system achieving a 40x computational cost reduction compared to our baseline exchange classifier, while maintaining a 0.05% refusal rate on production traffic. Through extensive red-teaming comprising over 1,700 hours, we demonstrate strong protection against universal jailbreaks -- no attack on this system successfully elicited responses to all eight target queries comparable in detail to an undefended model. Our work establishes Constitutional Classifiers as practical and efficient safeguards for large language models.
Problem

Research questions and friction points this paper is trying to address.

jailbreak robustness
universal jailbreaks
large language models
production-grade defenses
computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constitutional Classifiers
exchange classifiers
classifier cascade
linear probe
jailbreak defense
🔎 Similar Papers
No similar papers found.
Hoagy Cunningham
Hoagy Cunningham
Independent
Jerry Wei
Jerry Wei
Anthropic
large language modelsnatural language processinggame-changing research
Z
Zihan Wang
Anthropic
A
Andrew Persic
Anthropic
A
Alwin Peng
Anthropic
J
Jordan Abderrachid
Anthropic
R
Raj Agarwal
Anthropic
B
Bobby Chen
Anthropic
A
Austin Cohen
Anthropic
A
Andy Dau
Anthropic
A
Alek Dimitriev
Anthropic
R
Rob Gilson
Anthropic
L
Logan Howard
Anthropic
Y
Yijin Hua
Anthropic
Jared Kaplan
Jared Kaplan
Johns Hopkins University & Anthropic
Theoretical PhysicsMachine Learning
Jan Leike
Jan Leike
Anthropic
reinforcement learningdeep learningagent alignment
M
Mu Lin
Anthropic
C
Christopher Liu
Anthropic
V
Vladimir Mikulik
Anthropic
R
Rohit Mittapalli
Anthropic
C
Clare O'Hara
Anthropic
J
Jin Pan
Anthropic
N
Nikhil Saxena
Anthropic
A
Alex Silverstein
Anthropic
Yue Song
Yue Song
Caltech
Machine LearningGeometric Deep LearningAI4Science
X
Xunjie Yu
Anthropic
Giulio Zhou
Giulio Zhou
PhD Student, Univeristy of Edinburgh
NLPMT
Ethan Perez
Ethan Perez
Anthropic
AI Safety
Mrinank Sharma
Mrinank Sharma
Anthropic
AI SafetyMachine LearningArtificial Intelligence