Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM safety evaluations suffer from two key limitations: (1) adversarial jailbreaking attacks rely on sparse binary success signals, and (2) human-crafted scoring templates introduce subjective bias. To address these, we propose AMIS, a meta-optimization framework featuring a bi-level co-evolutionary mechanism: an inner loop leverages an LLM-based discriminator to provide fine-grained feedback for optimizing jailbreaking prompts; an outer loop dynamically aligns and refines the scoring template based on empirical attack success rates, jointly enhancing both attack efficacy and evaluation robustness. AMIS is the first method to enable end-to-end joint learning of prompt generation and scoring criteria, eliminating dependence on handcrafted priors and sparse rewards. On AdvBench and JBB-Behaviors benchmarks, AMIS achieves state-of-the-art performance—attaining 88.0% and 100.0% attack success rates against Claude-3.5-Haiku and Claude-4-Sonnet, respectively.

Technology Category

Application Category

📝 Abstract
Identifying the vulnerabilities of large language models (LLMs) is crucial for improving their safety by addressing inherent weaknesses. Jailbreaks, in which adversaries bypass safeguards with crafted input prompts, play a central role in red-teaming by probing LLMs to elicit unintended or unsafe behaviors. Recent optimization-based jailbreak approaches iteratively refine attack prompts by leveraging LLMs. However, they often rely heavily on either binary attack success rate (ASR) signals, which are sparse, or manually crafted scoring templates, which introduce human bias and uncertainty in the scoring outcomes. To address these limitations, we introduce AMIS (Align to MISalign), a meta-optimization framework that jointly evolves jailbreak prompts and scoring templates through a bi-level structure. In the inner loop, prompts are refined using fine-grained and dense feedback using a fixed scoring template. In the outer loop, the template is optimized using an ASR alignment score, gradually evolving to better reflect true attack outcomes across queries. This co-optimization process yields progressively stronger jailbreak prompts and more calibrated scoring signals. Evaluations on AdvBench and JBB-Behaviors demonstrate that AMIS achieves state-of-the-art performance, including 88.0% ASR on Claude-3.5-Haiku and 100.0% ASR on Claude-4-Sonnet, outperforming existing baselines by substantial margins.
Problem

Research questions and friction points this paper is trying to address.

Automatically jailbreak LLMs using meta-optimized judges
Address sparse binary signals and biased scoring templates
Jointly evolve attack prompts and scoring templates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta-optimization framework jointly evolves prompts and templates
Bi-level structure refines prompts and optimizes scoring templates
Co-optimization yields stronger jailbreaks and calibrated scoring signals
🔎 Similar Papers
No similar papers found.
H
Hamin Koo
Yonsei University
Minseon Kim
Minseon Kim
Microsoft Research
AI SafetyRobustnessRepresentation learning
J
Jaehyung Kim
Yonsei University