Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing LLM safety evaluations suffer from two key limitations: (1) adversarial jailbreaking attacks rely on sparse binary success signals, and (2) human-crafted scoring templates introduce subjective bias. To address these, we propose AMIS, a meta-optimization framework featuring a bi-level co-evolutionary mechanism: an inner loop leverages an LLM-based discriminator to provide fine-grained feedback for optimizing jailbreaking prompts; an outer loop dynamically aligns and refines the scoring template based on empirical attack success rates, jointly enhancing both attack efficacy and evaluation robustness. AMIS is the first method to enable end-to-end joint learning of prompt generation and scoring criteria, eliminating dependence on handcrafted priors and sparse rewards. On AdvBench and JBB-Behaviors benchmarks, AMIS achieves state-of-the-art performance—attaining 88.0% and 100.0% attack success rates against Claude-3.5-Haiku and Claude-4-Sonnet, respectively.

Technology Category

Application Category

📝 Abstract

Identifying the vulnerabilities of large language models (LLMs) is crucial for improving their safety by addressing inherent weaknesses. Jailbreaks, in which adversaries bypass safeguards with crafted input prompts, play a central role in red-teaming by probing LLMs to elicit unintended or unsafe behaviors. Recent optimization-based jailbreak approaches iteratively refine attack prompts by leveraging LLMs. However, they often rely heavily on either binary attack success rate (ASR) signals, which are sparse, or manually crafted scoring templates, which introduce human bias and uncertainty in the scoring outcomes. To address these limitations, we introduce AMIS (Align to MISalign), a meta-optimization framework that jointly evolves jailbreak prompts and scoring templates through a bi-level structure. In the inner loop, prompts are refined using fine-grained and dense feedback using a fixed scoring template. In the outer loop, the template is optimized using an ASR alignment score, gradually evolving to better reflect true attack outcomes across queries. This co-optimization process yields progressively stronger jailbreak prompts and more calibrated scoring signals. Evaluations on AdvBench and JBB-Behaviors demonstrate that AMIS achieves state-of-the-art performance, including 88.0% ASR on Claude-3.5-Haiku and 100.0% ASR on Claude-4-Sonnet, outperforming existing baselines by substantial margins.

Problem

Research questions and friction points this paper is trying to address.

Automatically jailbreak LLMs using meta-optimized judges

Address sparse binary signals and biased scoring templates

Jointly evolve attack prompts and scoring templates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta-optimization framework jointly evolves prompts and templates

Bi-level structure refines prompts and optimizes scoring templates

Co-optimization yields stronger jailbreaks and calibrated scoring signals

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation

2024-05-20arXiv.orgCitations: 18

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

Authors to Follow