🤖 AI Summary
This work proposes BAIT, a three-stage jailbreaking framework for circumventing safety mechanisms in large language models. By integrating boundary identification, refinement of safety rule representations, and multi-turn self-conditioned reasoning, BAIT progressively guides the model to generate harmful content through its own internal consistency. The approach introduces a novel boundary-guided, incremental elicitation strategy that repurposes the model’s self-reflection capability as a pathway for jailbreaking, demonstrating that preemptive questioning is significantly more effective than direct requests. Evaluated on benchmarks such as AdvBench, BAIT substantially outperforms existing methods, achieving high attack success rates with minimal triggers—often requiring only the first two stages of the framework.
📝 Abstract
In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed example. By expanding each step upon the model's previous responses, BAIT turns the model's own reasoning and consistency tendency into a disclosure pathway. Experiments on AdvBench, JailbreakBench, AIR-Bench, and SORRY-Bench demonstrate that BAIT consistently achieves strong attack success rates across top-tier large language models, significantly advancing conventional jailbreak baselines. Further analysis reveals that: 1) prevention-oriented framing significantly outperforms direct knowledge request; 2) the refinement step plays a critical role in disclosure escalation; and 3) the first two steps have a certain chance of eliciting harmful content while triggering little filtering.