🤖 AI Summary
This paper identifies a novel jailbreaking attack against large language models (LLMs) termed Task-Embedded Prompting (TIP) attacks, which exploit implicit sequence-to-sequence subtasks (e.g., password cracking, puzzle solving) embedded within benign task prompts to elicit prohibited outputs—revealing fundamental misalignments in LLMs’ task-level safety reasoning.
Method: We formally define the TIP attack paradigm and introduce PHRYGE, a benchmark framework integrating semantic steganography and task redirection via adversarial prompt engineering. PHRYGE supports multi-model red-teaming and customizable adversarial task templates.
Contribution/Results: Experiments demonstrate that TIP attacks successfully bypass safety guardrails across six prominent open- and closed-weight models—including GPT-4o and LLaMA-3.2—validating their broad applicability, high stealth, and systemic challenge to current alignment mechanisms. The work establishes TIP as a critical threat vector exposing latent vulnerabilities in task decomposition and instruction-following logic.
📝 Abstract
We present a novel class of jailbreak adversarial attacks on LLMs, termed Task-in-Prompt (TIP) attacks. Our approach embeds sequence-to-sequence tasks (e.g., cipher decoding, riddles, code execution) into the model's prompt to indirectly generate prohibited inputs. To systematically assess the effectiveness of these attacks, we introduce the PHRYGE benchmark. We demonstrate that our techniques successfully circumvent safeguards in six state-of-the-art language models, including GPT-4o and LLaMA 3.2. Our findings highlight critical weaknesses in current LLM safety alignments and underscore the urgent need for more sophisticated defence strategies. Warning: this paper contains examples of unethical inquiries used solely for research purposes.