🤖 AI Summary
Existing jailbreaking methods for large language models (LLMs) face critical limitations: white-box approaches require access to internal model parameters and thus fail against closed-source APIs, while black-box methods generate prompts lacking interpretability and cross-model transferability. This paper proposes TrojFill, a template-filling-based jailbreaking method designed specifically for black-box LLM APIs. TrojFill embeds malicious instructions into multi-step reasoning templates, leveraging example generation as a “Trojan horse” to circumvent safety alignment. Crucially, it formalizes jailbreaking as a joint task of template filling and unsafe reasoning, and introduces lightweight obfuscation techniques—including placeholder substitution and Caesar/Base64 encoding—to evade detection. Evaluated on mainstream models including Gemini and DeepSeek, TrojFill achieves a 100% attack success rate. It significantly enhances prompt interpretability, cross-model transferability, and practical deployability in real-world API environments.
📝 Abstract
Large Language Models (LLMs) have advanced rapidly and now encode extensive world knowledge. Despite safety fine-tuning, however, they remain susceptible to adversarial prompts that elicit harmful content. Existing jailbreak techniques fall into two categories: white-box methods (e.g., gradient-based approaches such as GCG), which require model internals and are infeasible for closed-source APIs, and black-box methods that rely on attacker LLMs to search or mutate prompts but often produce templates that lack explainability and transferability. We introduce TrojFill, a black-box jailbreak that reframes unsafe instruction as a template-filling task. TrojFill embeds obfuscated harmful instructions (e.g., via placeholder substitution or Caesar/Base64 encoding) inside a multi-part template that asks the model to (1) reason why the original instruction is unsafe (unsafety reasoning) and (2) generate a detailed example of the requested text, followed by a sentence-by-sentence analysis. The crucial "example" component acts as a Trojan Horse that contains the target jailbreak content while the surrounding task framing reduces refusal rates. We evaluate TrojFill on standard jailbreak benchmarks across leading LLMs (e.g., ChatGPT, Gemini, DeepSeek, Qwen), showing strong empirical performance (e.g., 100% attack success on Gemini-flash-2.5 and DeepSeek-3.1, and 97% on GPT-4o). Moreover, the generated prompts exhibit improved interpretability and transferability compared with prior black-box optimization approaches. We release our code, sample prompts, and generated outputs to support future red-teaming research.