Adversarial Suffix Filtering: a Defense Pipeline for LLMs

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Large language models (LLMs) are highly vulnerable to adversarial suffix jailbreaking attacks in both black-box and white-box settings. Method: This paper proposes a lightweight, model-agnostic input preprocessing defense pipeline that operates without access to model parameters or architecture. It employs a multi-stage filtering mechanism comprising semantic-sensitive feature extraction, dynamic suffix boundary detection, and a lightweight classifier for real-time adversarial suffix detection and blocking. Contribution/Results: The work introduces the first model-agnostic, low-overhead, and prompt-engineering-resistant adversarial suffix filtering paradigm. Experiments show that the success rate of mainstream adversarial suffix attacks drops from >90% to <4%, while degradation on benign tasks remains below 1%. Deployment overhead—both memory and computation—is negligible, significantly outperforming existing defenses reliant on model access or high resource consumption.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly embedded in autonomous systems and public-facing environments, yet they remain susceptible to jailbreak vulnerabilities that may undermine their security and trustworthiness. Adversarial suffixes are considered to be the current state-of-the-art jailbreak, consistently outperforming simpler methods and frequently succeeding even in black-box settings. Existing defenses rely on access to the internal architecture of models limiting diverse deployment, increase memory and computation footprints dramatically, or can be bypassed with simple prompt engineering methods. We introduce $ extbf{Adversarial Suffix Filtering}$ (ASF), a lightweight novel model-agnostic defensive pipeline designed to protect LLMs against adversarial suffix attacks. ASF functions as an input preprocessor and sanitizer that detects and filters adversarially crafted suffixes in prompts, effectively neutralizing malicious injections. We demonstrate that ASF provides comprehensive defense capabilities across both black-box and white-box attack settings, reducing the attack efficacy of state-of-the-art adversarial suffix generation methods to below 4%, while only minimally affecting the target model's capabilities in non-adversarial scenarios.

Problem

Research questions and friction points this paper is trying to address.

Defending LLMs against adversarial suffix jailbreak attacks

Addressing limitations of existing model-specific defenses

Providing lightweight, model-agnostic protection without performance degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight model-agnostic defense pipeline

Detects and filters adversarial suffix attacks

Works in black-box and white-box settings

🔎 Similar Papers

Robust LLM safeguarding via refusal feature adversarial training