🤖 AI Summary
This work identifies and systematically investigates the “context-initiated vulnerability” in large language models (LLMs): seemingly benign prior responses in conversational history can implicitly steer subsequent model outputs to violate safety policies. To exploit this, the authors propose a novel jailbreaking paradigm—leveraging an auxiliary LLM to generate weakly harmful “warm-up” responses, combined with refined trigger prompts and carefully designed dialogue formatting, enabling high-success-rate context-initiated attacks. Further, they construct the first context-aware safety fine-tuning dataset for defense. Experiments demonstrate that the attack achieves the highest average success rate across eight mainstream LLMs, outperforming seven state-of-the-art jailbreaking methods. The proposed defense significantly reduces attack success without degrading the model’s original capabilities or generalization performance.
📝 Abstract
Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.