Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work identifies and systematically investigates the “context-initiated vulnerability” in large language models (LLMs): seemingly benign prior responses in conversational history can implicitly steer subsequent model outputs to violate safety policies. To exploit this, the authors propose a novel jailbreaking paradigm—leveraging an auxiliary LLM to generate weakly harmful “warm-up” responses, combined with refined trigger prompts and carefully designed dialogue formatting, enabling high-success-rate context-initiated attacks. Further, they construct the first context-aware safety fine-tuning dataset for defense. Experiments demonstrate that the attack achieves the highest average success rate across eight mainstream LLMs, outperforming seven state-of-the-art jailbreaking methods. The proposed defense significantly reduces attack success without degrading the model’s original capabilities or generalization performance.

Technology Category

Application Category

📝 Abstract

Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.

Problem

Research questions and friction points this paper is trying to address.

Exploiting contextual priming to bypass LLM safety policies

Using previous responses to induce harmful content generation

Developing defenses against contextual priming attacks in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploits contextual priming for LLM jailbreaking

Uses auxiliary LLM to generate harmful responses

Context-aware safety fine-tuning mitigates attack success

🔎 Similar Papers

No similar papers found.