🤖 AI Summary
Existing approaches struggle to systematically generate high-quality, controllable toxic data for red-teaming large language models. This work proposes an automatic adversarial data generation framework that operates without human annotation by inverting harmlessness guidelines to construct toxicity objectives. It integrates a critique-and-revise iterative mechanism with probability-clamped reinforcement learning to effectively mitigate reward hacking while preserving adversarial intent and semantic coherence. Combining reversed constitutional AI principles, RLAIF (Reinforcement Learning from AI Feedback), and an automated optimization pipeline, the method produces data that significantly outperforms baseline approaches in both diversity and quality—achieving a 15% improvement in semantic coherence without compromising adversarial strength—and enables fully automated red-teaming evaluations.
📝 Abstract
Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique--revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate that R-CAI generates diverse, high-quality toxic data and that probability clamping substantially improves semantic coherence (15%) without sacrificing adversarial strength. Overall, R-CAI provides a fully automated framework for red teaming data generation and systematic safety evaluation of aligned language models.