Adversarial Suffix Filtering: a Defense Pipeline for LLMs

📅 2025-05-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) are highly vulnerable to adversarial suffix jailbreaking attacks in both black-box and white-box settings. Method: This paper proposes a lightweight, model-agnostic input preprocessing defense pipeline that operates without access to model parameters or architecture. It employs a multi-stage filtering mechanism comprising semantic-sensitive feature extraction, dynamic suffix boundary detection, and a lightweight classifier for real-time adversarial suffix detection and blocking. Contribution/Results: The work introduces the first model-agnostic, low-overhead, and prompt-engineering-resistant adversarial suffix filtering paradigm. Experiments show that the success rate of mainstream adversarial suffix attacks drops from >90% to <4%, while degradation on benign tasks remains below 1%. Deployment overhead—both memory and computation—is negligible, significantly outperforming existing defenses reliant on model access or high resource consumption.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly embedded in autonomous systems and public-facing environments, yet they remain susceptible to jailbreak vulnerabilities that may undermine their security and trustworthiness. Adversarial suffixes are considered to be the current state-of-the-art jailbreak, consistently outperforming simpler methods and frequently succeeding even in black-box settings. Existing defenses rely on access to the internal architecture of models limiting diverse deployment, increase memory and computation footprints dramatically, or can be bypassed with simple prompt engineering methods. We introduce $ extbf{Adversarial Suffix Filtering}$ (ASF), a lightweight novel model-agnostic defensive pipeline designed to protect LLMs against adversarial suffix attacks. ASF functions as an input preprocessor and sanitizer that detects and filters adversarially crafted suffixes in prompts, effectively neutralizing malicious injections. We demonstrate that ASF provides comprehensive defense capabilities across both black-box and white-box attack settings, reducing the attack efficacy of state-of-the-art adversarial suffix generation methods to below 4%, while only minimally affecting the target model's capabilities in non-adversarial scenarios.
Problem

Research questions and friction points this paper is trying to address.

Defending LLMs against adversarial suffix jailbreak attacks
Addressing limitations of existing model-specific defenses
Providing lightweight, model-agnostic protection without performance degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight model-agnostic defense pipeline
Detects and filters adversarial suffix attacks
Works in black-box and white-box settings