🤖 AI Summary
This work addresses suffix-based jailbreak attacks—where flexible, adversarial suffixes are used to elicit harmful outputs from aligned large language models—and proposes a controllability-driven active defense mechanism. Through lightweight fine-tuning, the method secretly embeds “optimization traps” within the model, causing attackers to either converge to ineffective suffixes or inadvertently generate traceable fingerprints during optimization. This approach achieves both robust defense and high-precision attribution without altering the inference pipeline, and it remains compatible with existing filtering-based defenses. Evaluated across diverse attack scenarios, the method reduces the average attack success rate to below 0.01% while achieving an 87.9% tracing accuracy, with only a 15.87 MB memory overhead and no added inference latency. To our knowledge, this is the first framework to seamlessly integrate active defense with traceability, significantly enhancing model security and robustness.
📝 Abstract
Suffix-based jailbreak attacks append an adversarial suffix, i.e., a short token sequence, to steer aligned LLMs into unsafe outputs. Since suffixes are free-form text, they admit endlessly many surface forms, making jailbreak mitigation difficult. Most existing defenses depend on passive detection of suspicious suffixes, without leveraging the defender's inherent asymmetric ability to inject secrets and proactively conceal gaps. Motivated by this, we take a controllability-oriented perspective and develop a proactive defense that nudges attackers into a no-win dilemma: either they fall into defender-designed optimization traps and fail to produce an effective adversarial suffix, or they can succeed only by generating adversarial suffixes that carry distinctive, traceable fingerprints. We propose TrapSuffix, a lightweight fine-tuning approach that injects trap-aligned behaviors into the base model without changing the inference pipeline. TrapSuffix channels jailbreak attempts into these two outcomes by reshaping the model's response landscape to adversarial suffixes. Across diverse suffix-based jailbreak settings, TrapSuffix reduces the average attack success rate to below 0.01 percent and achieves an average tracing success rate of 87.9 percent, providing both strong defense and reliable traceability. It introduces no inference-time overhead and incurs negligible memory cost, requiring only 15.87 MB of additional memory on average, whereas state-of-the-art LLM-based detection defenses typically incur memory overheads at the 1e4 MB level, while composing naturally with existing filtering-based defenses for complementary protection.