TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses jailbreaking attacks against safety alignment mechanisms of large language models (LLMs). We propose TASO, the first method to jointly and alternately optimize prompt templates and adversarial suffixes—decoupling control over the initial output token and global semantic behavior to achieve synergistic deception. TASO integrates gradient-based search with principled prompt engineering to generate highly adversarial yet natural-looking prompts. Evaluated on HarmBench and AdvBench benchmarks across 24 mainstream LLMs, TASO achieves state-of-the-art jailbreaking success rates, improving average attack success by 12.7% over prior methods while demonstrating strong cross-model generalization. Its core innovation lies in the decoupled modeling and alternating optimization framework for templates and suffixes, establishing a novel paradigm for systematic LLM safety evaluation.

Technology Category

Application Category

📝 Abstract
Many recent studies showed that LLMs are vulnerable to jailbreak attacks, where an attacker can perturb the input of an LLM to induce it to generate an output for a harmful question. In general, existing jailbreak techniques either optimize a semantic template intended to induce the LLM to produce harmful outputs or optimize a suffix that leads the LLM to initiate its response with specific tokens (e.g., "Sure"). In this work, we introduce TASO (Template and Suffix Optimization), a novel jailbreak method that optimizes both a template and a suffix in an alternating manner. Our insight is that suffix optimization and template optimization are complementary to each other: suffix optimization can effectively control the first few output tokens but cannot control the overall quality of the output, while template optimization provides guidance for the entire output but cannot effectively control the initial tokens, which significantly impact subsequent responses. Thus, they can be combined to improve the attack's effectiveness. We evaluate the effectiveness of TASO on benchmark datasets (including HarmBench and AdvBench) on 24 leading LLMs (including models from the Llama family, OpenAI, and DeepSeek). The results demonstrate that TASO can effectively jailbreak existing LLMs. We hope our work can inspire future studies in exploring this direction. We will make code and data publicly available.
Problem

Research questions and friction points this paper is trying to address.

Optimizing template and suffix to jailbreak LLMs
Controlling initial output tokens and overall response quality
Enhancing attack effectiveness on multiple benchmark datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes template and suffix alternately
Combines complementary optimization strategies
Controls initial tokens and overall output quality
🔎 Similar Papers
No similar papers found.