AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the underutilization of the AutoDAN-Turbo strategy library for jailbreaking large language models (LLMs). We propose two test-time expansion methods: Best-of-N sampling for high-quality candidate prompt generation within a single strategy, and Beam Search for multi-strategy collaborative composition. Both leverage a lifelong-learning-constructed strategy library and a lightweight scorer model for efficient evaluation and selection. Experiments demonstrate that Beam Search improves attack success rate by 15.6 percentage points on Llama-3.1-70B-Instruct and achieves a nearly 60% relative gain on GPT-4o-mini. Our core contribution is the first introduction of structured search mechanisms—particularly multi-strategy Beam Search—into jailbreak prompt generation, thereby significantly unlocking the generalization potential of pre-trained strategy libraries.

Technology Category

Application Category

📝 Abstract

Recent advancements in jailbreaking large language models (LLMs), such as AutoDAN-Turbo, have demonstrated the power of automated strategy discovery. AutoDAN-Turbo employs a lifelong learning agent to build a rich library of attack strategies from scratch. While highly effective, its test-time generation process involves sampling a strategy and generating a single corresponding attack prompt, which may not fully exploit the potential of the learned strategy library. In this paper, we propose to further improve the attack performance of AutoDAN-Turbo through test-time scaling. We introduce two distinct scaling methods: Best-of-N and Beam Search. The Best-of-N method generates N candidate attack prompts from a sampled strategy and selects the most effective one based on a scorer model. The Beam Search method conducts a more exhaustive search by exploring combinations of strategies from the library to discover more potent and synergistic attack vectors. According to the experiments, the proposed methods significantly boost performance, with Beam Search increasing the attack success rate by up to 15.6 percentage points on Llama-3.1-70B-Instruct and achieving a nearly 60% relative improvement against the highly robust GPT-o4-mini compared to the vanilla method.

Problem

Research questions and friction points this paper is trying to address.

Enhancing jailbreak attacks through test-time scaling

Improving attack success rates with Best-of-N selection

Exploring synergistic strategy combinations via Beam Search

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time scaling enhances jailbreak attack performance

Best-of-N method selects optimal prompt from multiple candidates

Beam Search explores synergistic strategy combinations exhaustively

🔎 Similar Papers

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs