SafeSteer: Adaptive Subspace Steering for Efficient Jailbreak Defense in Vision-Language Models

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

To address the vulnerability of vision-language models (VLMs) to jailbreaking attacks and the trade-off between security and practicality in existing defenses, this paper proposes a lightweight, inference-time adaptive defense framework. Unlike weight-modification approaches, our method leverages singular value decomposition (SVD) to construct a low-dimensional “safe subspace” and dynamically projects and reconstructs input steering vectors during inference, thereby adaptively suppressing malicious generation signals. The defense requires no fine-tuning and operates entirely within a single forward pass. Experiments demonstrate that our approach reduces jailbreaking success rates by over 60%, improves accuracy on standard tasks by 1–2 percentage points, and incurs negligible computational overhead—thus significantly enhancing VLM security, inference efficiency, and real-world deployability.

Technology Category

Application Category

📝 Abstract

As the capabilities of Vision Language Models (VLMs) continue to improve, they are increasingly targeted by jailbreak attacks. Existing defense methods face two major limitations: (1) they struggle to ensure safety without compromising the model's utility; and (2) many defense mechanisms significantly reduce the model's inference efficiency. To address these challenges, we propose SafeSteer, a lightweight, inference-time steering framework that effectively defends against diverse jailbreak attacks without modifying model weights. At the core of SafeSteer is the innovative use of Singular Value Decomposition to construct a low-dimensional "safety subspace." By projecting and reconstructing the raw steering vector into this subspace during inference, SafeSteer adaptively removes harmful generation signals while preserving the model's ability to handle benign inputs. The entire process is executed in a single inference pass, introducing negligible overhead. Extensive experiments show that SafeSteer reduces the attack success rate by over 60% and improves accuracy on normal tasks by 1-2%, without introducing significant inference latency. These results demonstrate that robust and practical jailbreak defense can be achieved through simple, efficient inference-time control.

Problem

Research questions and friction points this paper is trying to address.

Defending vision-language models against jailbreak attacks efficiently

Preserving model utility while ensuring safety during inference

Reducing attack success rates without significant inference latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight steering framework for jailbreak defense

Singular Value Decomposition constructs safety subspace

Projects steering vectors to remove harmful signals

🔎 Similar Papers

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance