The TIP of the Iceberg: Revealing a Hidden Class of Task-In-Prompt Adversarial Attacks on LLMs

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a novel jailbreaking attack against large language models (LLMs) termed Task-Embedded Prompting (TIP) attacks, which exploit implicit sequence-to-sequence subtasks (e.g., password cracking, puzzle solving) embedded within benign task prompts to elicit prohibited outputs—revealing fundamental misalignments in LLMs’ task-level safety reasoning. Method: We formally define the TIP attack paradigm and introduce PHRYGE, a benchmark framework integrating semantic steganography and task redirection via adversarial prompt engineering. PHRYGE supports multi-model red-teaming and customizable adversarial task templates. Contribution/Results: Experiments demonstrate that TIP attacks successfully bypass safety guardrails across six prominent open- and closed-weight models—including GPT-4o and LLaMA-3.2—validating their broad applicability, high stealth, and systemic challenge to current alignment mechanisms. The work establishes TIP as a critical threat vector exposing latent vulnerabilities in task decomposition and instruction-following logic.

Technology Category

Application Category

📝 Abstract
We present a novel class of jailbreak adversarial attacks on LLMs, termed Task-in-Prompt (TIP) attacks. Our approach embeds sequence-to-sequence tasks (e.g., cipher decoding, riddles, code execution) into the model's prompt to indirectly generate prohibited inputs. To systematically assess the effectiveness of these attacks, we introduce the PHRYGE benchmark. We demonstrate that our techniques successfully circumvent safeguards in six state-of-the-art language models, including GPT-4o and LLaMA 3.2. Our findings highlight critical weaknesses in current LLM safety alignments and underscore the urgent need for more sophisticated defence strategies. Warning: this paper contains examples of unethical inquiries used solely for research purposes.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Adversarial Attacks
Superintelligence Security
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial Attacks
Large Language Models
Task-Embedded Prompts