MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

📅 2026-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a lightweight, inference-time defense mechanism against adversarial jailbreak attacks on large language models (LLMs), addressing the limitations of existing binary classifier–based approaches that suffer from poor generalization, high fine-tuning costs, and potential degradation of model performance. For the first time, the method introduces a diffusion process into LLM safety by performing score function–driven density estimation on the manifold of benign representations, projecting anomalous hidden states onto safe regions. Notably, the approach requires no training on harmful examples and leaves the original model architecture unchanged. Evaluated on Mistral-7B, Llama-3.1-8B, and Gemma-2-9B, it reduces attack success rates by up to 100% while fully preserving the model’s original performance on benign inputs.

Technology Category

Application Category

📝 Abstract
Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.
Problem

Research questions and friction points this paper is trying to address.

adversarial jailbreak attacks
LLM safety
inference-time defense
diffusion-based defense
representation manifold
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion-based defense
inference-time safety
density estimation
representation manifold
adversarial jailbreak