🤖 AI Summary
Large reasoning models (LRMs) exhibit strong alignment but remain vulnerable to jailbreaking attacks, yet existing methods struggle due to the black-box nature of state-of-the-art (SOTA) LRMs. Method: This paper proposes AutoRAN—the first automated “weak-to-strong” jailbreaking framework specifically designed for LRMs. To circumvent black-box constraints, AutoRAN distills the high-level reasoning structure of the target LRM using a weaker aligned LRM, generates initial adversarial prompts via narrative-based prompt engineering, and iteratively refines prompts using intermediate reasoning-step feedback. Contribution/Results: AutoRAN pioneers weak-model-driven jailbreaking of strong LRMs, uncovers LRM-specific alignment fragility, and establishes a reasoning-trajectory-guided optimization paradigm. On AdvBench, HarmBench, and StrongReject benchmarks, AutoRAN achieves near-100% single-round jailbreak success against SOTA models—including GPT-4o-mini and Gemini-2.5-Flash—and maintains effectiveness under validation by strongly aligned external evaluators.
📝 Abstract
This paper presents AutoRAN, the first automated, weak-to-strong jailbreak attack framework targeting large reasoning models (LRMs). At its core, AutoRAN leverages a weak, less-aligned reasoning model to simulate the target model's high-level reasoning structures, generates narrative prompts, and iteratively refines candidate prompts by incorporating the target model's intermediate reasoning steps. We evaluate AutoRAN against state-of-the-art LRMs including GPT-o3/o4-mini and Gemini-2.5-Flash across multiple benchmark datasets (AdvBench, HarmBench, and StrongReject). Results demonstrate that AutoRAN achieves remarkable success rates (approaching 100%) within one or a few turns across different LRMs, even when judged by a robustly aligned external model. This work reveals that leveraging weak reasoning models can effectively exploit the critical vulnerabilities of much more capable reasoning models, highlighting the need for improved safety measures specifically designed for reasoning-based models. The code for replicating AutoRAN and running records are available at: (https://github.com/JACKPURCELL/AutoRAN-public). (warning: this paper contains potentially harmful content generated by LRMs.)