TAIJI: Textual Anchoring for Immunizing Jailbreak Images in Vision Language Models

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Visual language models (VLMs) are vulnerable to image-based jailbreak attacks, while existing black-box defenses suffer from multi-query requirements, input constraints, or performance degradation. To address these limitations, this paper proposes a parameter-free black-box defense framework. Its core innovation is a keyword-driven textual anchoring mechanism: it automatically extracts salient phrases from inputs and performs joint safety verification and multimodal consistency checking in a single forward pass. The method imposes no parameter access requirement, operates with only one query, and imposes no restrictions on input format. Evaluated across multiple VLMs and jailbreak image benchmarks, it achieves an average defense success rate improvement of over 42%, while degrading benign task accuracy by less than 0.8%. Thus, it effectively balances robustness and utility without compromising standard performance.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) have demonstrated impressive inference capabilities, but remain vulnerable to jailbreak attacks that can induce harmful or unethical responses. Existing defence methods are predominantly white-box approaches that require access to model parameters and extensive modifications, making them costly and impractical for many real-world scenarios. Although some black-box defences have been proposed, they often impose input constraints or require multiple queries, limiting their effectiveness in safety-critical tasks such as autonomous driving. To address these challenges, we propose a novel black-box defence framework called extbf{T}extual extbf{A}nchoring for extbf{I}mmunizing extbf{J}ailbreak extbf{I}mages ( extbf{TAIJI}). TAIJI leverages key phrase-based textual anchoring to enhance the model's ability to assess and mitigate the harmful content embedded within both visual and textual prompts. Unlike existing methods, TAIJI operates effectively with a single query during inference, while preserving the VLM's performance on benign tasks. Extensive experiments demonstrate that TAIJI significantly enhances the safety and reliability of VLMs, providing a practical and efficient solution for real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Defending VLMs against jailbreak attacks inducing harmful responses

Overcoming limitations of white-box and existing black-box defense methods

Enhancing safety without compromising performance on benign tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Black-box defense using textual anchoring

Single query operation during inference

Key phrase-based mitigation of harmful content

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment