Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a novel security threat in which adversaries exploit fine-tuning interfaces to bypass LLM-based content moderation through adversarial fine-tuning. To counter this, the authors propose Trojan-Speak, a method that integrates curriculum learning with GRPO-based hybrid reinforcement learning to train models to communicate via covert protocols that evade constitutional classifiers while preserving core reasoning capabilities. Evaluated on models exceeding 14 billion parameters, Trojan-Speak achieves over 99% evasion rates against state-of-the-art classifiers, generates detailed responses to expert-level CBRN queries, and incurs less than a 5% degradation in reasoning performance. This study is the first to demonstrate that relying solely on LLM-based classifiers is insufficient to defend against attackers with fine-tuning access and introduces activation-layer probing to enhance defense robustness.
📝 Abstract
Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.
Problem

Research questions and friction points this paper is trying to address.

adversarial fine-tuning
Constitutional Classifiers
LLM safety
content moderation
Trojan-Speak
Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial fine-tuning
constitutional classifiers
curriculum learning
GRPO-based reinforcement learning
activation-level probes
🔎 Similar Papers
No similar papers found.