SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) are vulnerable to adversarial jailbreaking attacks via seemingly benign, obfuscated prompts. Method: This paper proposes a lightweight, model-agnostic preprocessing defense framework that integrates fine-grained, multi-class safety classification with category-specific response strategies—namely, blocking, reconstruction, or forwarding—guided by a safety classifier and modular decision logic applied at inference time. The approach performs real-time detection and adaptive intervention on inputs without fine-tuning or modifying the target LVLM. Contribution/Results: Evaluated across five benchmarks and five state-of-the-art LVLMs, the method significantly reduces jailbreaking success rates and instruction deviation while fully preserving original task performance. It incurs zero computational overhead, offers plug-and-play deployment, and supports flexible extension to novel attack types—delivering high-compatibility, multimodal safety protection.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) unlock powerful multimodal reasoning but also expand the attack surface, particularly through adversarial inputs that conceal harmful goals in benign prompts. We propose SHIELD, a lightweight, model-agnostic preprocessing framework that couples fine-grained safety classification with category-specific guidance and explicit actions (Block, Reframe, Forward). Unlike binary moderators, SHIELD composes tailored safety prompts that enforce nuanced refusals or safe redirection without retraining. Across five benchmarks and five representative LVLMs, SHIELD consistently lowers jailbreak and non-following rates while preserving utility. Our method is plug-and-play, incurs negligible overhead, and is easily extendable to new attack types -- serving as a practical safety patch for both weakly and strongly aligned LVLMs.

Problem

Research questions and friction points this paper is trying to address.

Protecting LVLMs from adversarial inputs with harmful goals

Providing nuanced safety responses without model retraining

Reducing jailbreak rates while maintaining model utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Classifier-guided prompting for robust LVLMs

Lightweight preprocessing framework with safety classification

Plug-and-play safety patch with explicit moderation actions

🔎 Similar Papers

LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models