The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing jailbreaking methods for large language models (LLMs) face critical limitations: white-box approaches require access to internal model parameters and thus fail against closed-source APIs, while black-box methods generate prompts lacking interpretability and cross-model transferability. This paper proposes TrojFill, a template-filling-based jailbreaking method designed specifically for black-box LLM APIs. TrojFill embeds malicious instructions into multi-step reasoning templates, leveraging example generation as a “Trojan horse” to circumvent safety alignment. Crucially, it formalizes jailbreaking as a joint task of template filling and unsafe reasoning, and introduces lightweight obfuscation techniques—including placeholder substitution and Caesar/Base64 encoding—to evade detection. Evaluated on mainstream models including Gemini and DeepSeek, TrojFill achieves a 100% attack success rate. It significantly enhances prompt interpretability, cross-model transferability, and practical deployability in real-world API environments.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have advanced rapidly and now encode extensive world knowledge. Despite safety fine-tuning, however, they remain susceptible to adversarial prompts that elicit harmful content. Existing jailbreak techniques fall into two categories: white-box methods (e.g., gradient-based approaches such as GCG), which require model internals and are infeasible for closed-source APIs, and black-box methods that rely on attacker LLMs to search or mutate prompts but often produce templates that lack explainability and transferability. We introduce TrojFill, a black-box jailbreak that reframes unsafe instruction as a template-filling task. TrojFill embeds obfuscated harmful instructions (e.g., via placeholder substitution or Caesar/Base64 encoding) inside a multi-part template that asks the model to (1) reason why the original instruction is unsafe (unsafety reasoning) and (2) generate a detailed example of the requested text, followed by a sentence-by-sentence analysis. The crucial "example" component acts as a Trojan Horse that contains the target jailbreak content while the surrounding task framing reduces refusal rates. We evaluate TrojFill on standard jailbreak benchmarks across leading LLMs (e.g., ChatGPT, Gemini, DeepSeek, Qwen), showing strong empirical performance (e.g., 100% attack success on Gemini-flash-2.5 and DeepSeek-3.1, and 97% on GPT-4o). Moreover, the generated prompts exhibit improved interpretability and transferability compared with prior black-box optimization approaches. We release our code, sample prompts, and generated outputs to support future red-teaming research.

Problem

Research questions and friction points this paper is trying to address.

Developing black-box jailbreak attacks against safety-tuned LLMs

Generating interpretable adversarial prompts through template filling

Embedding harmful instructions via unsafety reasoning techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Embeds obfuscated harmful instructions in templates

Uses unsafety reasoning to reduce refusal rates

Generates detailed examples as Trojan Horse content

🔎 Similar Papers

Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks

2024-07-01Conference on Empirical Methods in Natural Language ProcessingCitations: 2

💼 Related Jobs

No related jobs found.

Authors to Follow