🤖 AI Summary
Existing jailbreak attacks predominantly target single-turn prompts, overlooking the critical influence of dialogue history on large language model (LLM) security. This paper introduces Dialogue Injection Attack (DIA), a novel black-box jailbreaking paradigm that systematically reveals and exploits historical context to significantly enhance attack success rates. DIA comprises two construction strategies—prefill adaptation and delayed response—enabled by chat template reverse-engineering, prefill transfer, and response timing modeling. Evaluated on state-of-the-art models including Llama-3.1 and GPT-4o, DIA achieves new SOTA success rates and bypasses five representative safety mechanisms. By transcending the limitations of single-turn attacks, DIA provides both empirical evidence and conceptual insight for advancing LLM security evaluation and robust defense design.
📝 Abstract
Large language models (LLMs) have demonstrated significant utility in a wide range of applications; however, their deployment is plagued by security vulnerabilities, notably jailbreak attacks. These attacks manipulate LLMs to generate harmful or unethical content by crafting adversarial prompts. While much of the current research on jailbreak attacks has focused on single-turn interactions, it has largely overlooked the impact of historical dialogues on model behavior. In this paper, we introduce a novel jailbreak paradigm, Dialogue Injection Attack (DIA), which leverages the dialogue history to enhance the success rates of such attacks. DIA operates in a black-box setting, requiring only access to the chat API or knowledge of the LLM's chat template. We propose two methods for constructing adversarial historical dialogues: one adapts gray-box prefilling attacks, and the other exploits deferred responses. Our experiments show that DIA achieves state-of-the-art attack success rates on recent LLMs, including Llama-3.1 and GPT-4o. Additionally, we demonstrate that DIA can bypass 5 different defense mechanisms, highlighting its robustness and effectiveness.