Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to generic jailbreak attacks, undermining safety guardrails. Method: This paper proposes the Constitutional Classifier—a lightweight, constitution-driven binary classifier that leverages natural-language constitutional principles to guide the LLM in autonomously generating and self-labeling synthetic training data, trained via supervised binary classification to enable real-time, multi-turn harmful interaction detection. Contribution/Results: To our knowledge, this is the first lightweight approach demonstrating substantive defense against generic jailbreaks across thousands of hours of red-teaming: no effective bypasses were identified by red teams, and robust protection extends to domain-specific jailbreaks. In production deployment, it incurs only a 0.38-percentage-point increase in refusal rate and a 23.7% overhead in inference latency—achieving a favorable trade-off between strong security guarantees and practical deployability.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Adversarial Attacks

Security Vulnerabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constitution Classifier

Security Enforcement

Behavior Boundary

🔎 Similar Papers

Identify As A Human Does: A Pathfinder of Next-Generation Anti-Cheat Framework for First-Person Shooter Games