Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

📅 2025-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) are vulnerable to generic jailbreak attacks, undermining safety guardrails. Method: This paper proposes the Constitutional Classifier—a lightweight, constitution-driven binary classifier that leverages natural-language constitutional principles to guide the LLM in autonomously generating and self-labeling synthetic training data, trained via supervised binary classification to enable real-time, multi-turn harmful interaction detection. Contribution/Results: To our knowledge, this is the first lightweight approach demonstrating substantive defense against generic jailbreaks across thousands of hours of red-teaming: no effective bypasses were identified by red teams, and robust protection extends to domain-specific jailbreaks. In production deployment, it incurs only a 0.38-percentage-point increase in refusal rate and a 23.7% overhead in inference latency—achieving a favorable trade-off between strong security guarantees and practical deployability.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Adversarial Attacks
Security Vulnerabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constitution Classifier
Security Enforcement
Behavior Boundary
🔎 Similar Papers
No similar papers found.
Mrinank Sharma
Mrinank Sharma
Anthropic
AI SafetyMachine LearningArtificial Intelligence
Meg Tong
Meg Tong
Anthropic
machine learninglanguage model evaluation
Jesse Mu
Jesse Mu
Anthropic
Natural Language ProcessingMachine Learning
Jerry Wei
Jerry Wei
Anthropic
large language modelsnatural language processinggame-changing research
J
J. Kruthoff
Safeguards Research Team, Anthropic
S
Scott Goodfriend
Safeguards Research Team, Anthropic
Euan Ong
Euan Ong
Anthropic
Machine LearningScience of Deep LearningMechanistic InterpretabilityAlgorithmic Reasoning
A
Alwin Peng
Safeguards Research Team, Anthropic
R
Raj Agarwal
Safeguards Research Team, Anthropic
Cem Anil
Cem Anil
Safeguards Research Team, Anthropic
Amanda Askell
Amanda Askell
Anthropic
artificial intelligencemachine learningphilosophyethicsdecision theory
N
Nathan Bailey
Safeguards Research Team, Anthropic
Joe Benton
Joe Benton
Anthropic
Machine LearningStatistics
Emma Bluemke
Emma Bluemke
Anthropic
Medical ImagingPrivacy Enhancing TechnologyAI Governance
Samuel R. Bowman
Samuel R. Bowman
Anthropic and NYU
ai alignmentnatural language processinglarge language modelscomputational linguistics
Eric Christiansen
Eric Christiansen
Google Research
computer visionmachine learning
Hoagy Cunningham
Hoagy Cunningham
Independent
A
Andy Dau
Safeguards Research Team, Anthropic
A
Anjali Gopal
Safeguards Research Team, Anthropic
R
Rob Gilson
Safeguards Research Team, Anthropic
Logan Graham
Logan Graham
Anthropic
Machine Learning
L
Logan Howard
Safeguards Research Team, Anthropic
N
Nimit Kalra
Haize Labs
Taesung Lee
Taesung Lee
Anthropic
Artificial IntelligenceAI Trust & SafetyNatural Language ProcessingAI Safeguards
K
Kevin Lin
Safeguards Research Team, Anthropic
P
Peter Lofgren
Safeguards Research Team, Anthropic
Francesco Mosconi
Francesco Mosconi
Safeguards Research Team, Anthropic
C
Clare O'Hara
Independent
Catherine Olsson
Catherine Olsson
Anthropic
Machine Learning
L
Linda Petrini
Safeguards Research Team, Anthropic
S
Samir Rajani
Safeguards Research Team, Anthropic
N
Nikhil Saxena
Safeguards Research Team, Anthropic
A
Alex Silverstein
Safeguards Research Team, Anthropic
T
Tanya Singh
Safeguards Research Team, Anthropic
T
T. Sumers
Safeguards Research Team, Anthropic
L
Leonard Tang
Haize Labs
K
Kevin K. Troy
Haize Labs
C
Constantin Weisser
Haize Labs
Ruiqi Zhong
Ruiqi Zhong
Thinking Machines Lab
Human+AI Collaboration
Giulio Zhou
Giulio Zhou
PhD Student, Univeristy of Edinburgh
NLPMT
Jan Leike
Jan Leike
Anthropic
reinforcement learningdeep learningagent alignment
Jared Kaplan
Jared Kaplan
Johns Hopkins University & Anthropic
Theoretical PhysicsMachine Learning
Ethan Perez
Ethan Perez
Anthropic
AI Safety