AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Large reasoning models (LRMs) exhibit strong alignment but remain vulnerable to jailbreaking attacks, yet existing methods struggle due to the black-box nature of state-of-the-art (SOTA) LRMs. Method: This paper proposes AutoRAN—the first automated “weak-to-strong” jailbreaking framework specifically designed for LRMs. To circumvent black-box constraints, AutoRAN distills the high-level reasoning structure of the target LRM using a weaker aligned LRM, generates initial adversarial prompts via narrative-based prompt engineering, and iteratively refines prompts using intermediate reasoning-step feedback. Contribution/Results: AutoRAN pioneers weak-model-driven jailbreaking of strong LRMs, uncovers LRM-specific alignment fragility, and establishes a reasoning-trajectory-guided optimization paradigm. On AdvBench, HarmBench, and StrongReject benchmarks, AutoRAN achieves near-100% single-round jailbreak success against SOTA models—including GPT-4o-mini and Gemini-2.5-Flash—and maintains effectiveness under validation by strongly aligned external evaluators.

Technology Category

Application Category

📝 Abstract

This paper presents AutoRAN, the first automated, weak-to-strong jailbreak attack framework targeting large reasoning models (LRMs). At its core, AutoRAN leverages a weak, less-aligned reasoning model to simulate the target model's high-level reasoning structures, generates narrative prompts, and iteratively refines candidate prompts by incorporating the target model's intermediate reasoning steps. We evaluate AutoRAN against state-of-the-art LRMs including GPT-o3/o4-mini and Gemini-2.5-Flash across multiple benchmark datasets (AdvBench, HarmBench, and StrongReject). Results demonstrate that AutoRAN achieves remarkable success rates (approaching 100%) within one or a few turns across different LRMs, even when judged by a robustly aligned external model. This work reveals that leveraging weak reasoning models can effectively exploit the critical vulnerabilities of much more capable reasoning models, highlighting the need for improved safety measures specifically designed for reasoning-based models. The code for replicating AutoRAN and running records are available at: (https://github.com/JACKPURCELL/AutoRAN-public). (warning: this paper contains potentially harmful content generated by LRMs.)

Problem

Research questions and friction points this paper is trying to address.

Automated jailbreak attacks on large reasoning models

Exploiting weak models to target strong reasoning vulnerabilities

Evaluating effectiveness across multiple benchmark datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated weak-to-strong jailbreak attack framework

Leverages weak model to simulate reasoning structures

Iteratively refines prompts using target model's reasoning

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting