Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

📅 2024-10-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the trade-off between safety and utility of large language models (LLMs) under jailbreaking attacks, this paper proposes a dynamic sparse representation modulation method applied at inference time. We first empirically discover that safety-relevant information in LLMs is highly sparse—intervening in only ~5% of hidden states achieves safety control equivalent to full-state intervention. Our method constructs safety-direction vectors via directional latent-space projection and performs lightweight, inference-time redirection of critical hidden states—without modifying input tokens or introducing computational overhead—enabling continuous, fine-grained safety-utility balancing. Evaluated across nine models spanning 2B to 72B parameters, our approach robustly mitigates ten distinct jailbreaking attack categories, outperforming six state-of-the-art defenses while maintaining a benign query pass rate of ≥98.2%.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) become integral to various applications, ensuring both their safety and utility is paramount. Jailbreak attacks, which manipulate LLMs into generating harmful content, pose significant challenges to this balance. Existing defenses, such as prompt engineering and safety fine-tuning, often introduce computational overhead, increase inference latency, and lack runtime flexibility. Moreover, overly restrictive safety measures can degrade model utility by causing refusals of benign queries. In this paper, we introduce Jailbreak Antidote, a method that enables real-time adjustment of LLM safety preferences by manipulating a sparse subset of the model's internal states during inference. By shifting the model's hidden representations along a safety direction with varying strengths, we achieve flexible control over the safety-utility balance without additional token overhead or inference delays. Our analysis reveals that safety-related information in LLMs is sparsely distributed; adjusting approximately 5% of the internal state is as effective as modifying the entire state. Extensive experiments on nine LLMs (ranging from 2 billion to 72 billion parameters), evaluated against ten jailbreak attack methods and compared with six defense strategies, validate the effectiveness and efficiency of our approach. By directly manipulating internal states during reasoning, Jailbreak Antidote offers a lightweight, scalable solution that enhances LLM safety while preserving utility, opening new possibilities for real-time safety mechanisms in widely-deployed AI systems.
Problem

Research questions and friction points this paper is trying to address.

Ensuring safety and utility in LLMs
Mitigating jailbreak attacks efficiently
Real-time adjustment of safety preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time LLM safety preference adjustment
Sparse internal state manipulation
No additional token overhead
🔎 Similar Papers
No similar papers found.
G
Guobin Shen
Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences; Beijing Institute of AI Safety and Governance; Center for Long-term Artificial Intelligence; School of Future Technology, University of Chinese Academy of Sciences
Dongcheng Zhao
Dongcheng Zhao
Beijing Institute of AI Safety and Governance
Spiking Neural NetworksEvent Based VisionBrain-inspired AILLM Safety
Yiting Dong
Yiting Dong
Peking University, Institute of Automation, CAS
Brain Inspired IntelligenceSpiking Neural NetworkEvent-based VisionLarge Language Model
X
Xiang He
Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences; Beijing Institute of AI Safety and Governance; Center for Long-term Artificial Intelligence
Y
Yi Zeng
Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences; Beijing Institute of AI Safety and Governance; Center for Long-term Artificial Intelligence; School of Future Technology, University of Chinese Academy of Sciences; Beijing Institute of AI Safety and Governance