PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the vulnerability of large language models (LLMs) to many-shot jailbreaking attacks in long-context scenarios. We propose a novel prompt-level hybrid attack method that innovatively integrates affirmative prompts, negative demonstrations, and a topic-aware adaptive sampling mechanism. Leveraging topic clustering, adversarial sample selection, and Transformer attention analysis, our approach precisely identifies and exploits attentional vulnerabilities within long input sequences. Experimental results on AdvBench and HarmBench demonstrate that our method significantly outperforms existing state-of-the-art techniques: it achieves up to a 37.2% improvement in attack success rate across multiple mainstream LLMs. Crucially, this is the first systematic study to uncover critical failure pathways in long-context safety alignment mechanisms—revealing how extended context length introduces previously unrecognized structural weaknesses in attention-based safety guardrails.

Technology Category

Application Category

📝 Abstract

Many-shot jailbreaking circumvents the safety alignment of large language models by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational turns between the user and the model. These fabricated exchanges are randomly sampled from a pool of malicious questions and responses, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with positive affirmations, negative demonstrations, and an optimized adaptive sampling method tailored to the target prompt's topic. Extensive experiments on AdvBench and HarmBench, using state-of-the-art LLMs, demonstrate that PANDAS significantly outperforms baseline methods in long-context scenarios. Through an attention analysis, we provide insights on how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.

Problem

Research questions and friction points this paper is trying to address.

Enhancing many-shot jailbreaking techniques

Improving safety alignment in language models

Optimizing adaptive sampling for malicious prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Positive affirmations enhance dialogue

Negative demonstrations prevent compliance

Adaptive sampling optimizes context relevance

🔎 Similar Papers

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs