Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) are vulnerable to visual jailbreaking attacks via maliciously crafted images; existing defenses predominantly target the textual modality, neglecting risks at the visual input interface, and often incur computational overhead or degrade performance on benign inputs. This paper introduces the “Vision-as-Defense” paradigm, the first to repurpose the image space—not as an attack surface but as a carrier for embedding safety instructions—enabling dual-channel (vision + language) collaborative defense. Our approach comprises three components: (1) gradient-based robust image embedding, (2) cross-modal safety instruction fusion, and (3) a lightweight adversarial defense architecture. Experiments demonstrate that our method significantly enhances LVLM robustness against visual jailbreaking while preserving zero accuracy loss on benign tasks and introducing negligible inference latency overhead (<1.2%). The solution thus achieves a favorable trade-off among security, efficiency, and functional performance.

Technology Category

Application Category

📝 Abstract

Deploying large vision-language models (LVLMs) introduces a unique vulnerability: susceptibility to malicious attacks via visual inputs. However, existing defense methods suffer from two key limitations: (1) They solely focus on textual defenses, fail to directly address threats in the visual domain where attacks originate, and (2) the additional processing steps often incur significant computational overhead or compromise model performance on benign tasks. Building on these insights, we propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism. Initially, we embed security instructions into defensive images through gradient-based optimization, obtaining security instructions in the visual dimension. Subsequently, we integrate security instructions from visual and textual dimensions with the input query. The collaboration between security instructions from different dimensions ensures comprehensive security protection. Extensive experiments demonstrate that our approach effectively fortifies the robustness of LVLMs against such attacks while preserving their performance on standard benign tasks and incurring an imperceptible increase in time costs.

Problem

Research questions and friction points this paper is trying to address.

Protects large vision-language models from visual jailbreak attacks

Addresses limitations of existing textual-only defense methods

Minimizes computational overhead while maintaining model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Embed security instructions into images.

Combine visual and textual security instructions.

Optimize defense with minimal computational overhead.

🔎 Similar Papers

No similar papers found.