🤖 AI Summary
This work identifies a critical security vulnerability in large vision-language models (LVLMs) arising from their semantic slot-filling mechanism, which attackers can exploit to induce harmful content generation. To address this, the authors propose StructAttack, a novel framework that reveals this previously unexamined risk and introduces a black-box, single-query jailbreaking method. The approach decomposes malicious intent into a benign topic and multiple harmless slot types, embedding them within structured visual prompts—such as mind maps—to subtly guide the model toward reconstructing highly coherent harmful outputs without triggering safety safeguards. Extensive experiments across multiple state-of-the-art LVLMs and standard benchmarks demonstrate the effectiveness of StructAttack, showing it significantly bypasses existing defense mechanisms while maintaining high output coherence.
📝 Abstract
Despite the rapid progress of Large Vision-Language Models (LVLMs), the integration of visual modalities introduces new safety vulnerabilities that adversaries can exploit to elicit biased or malicious outputs. In this paper, we demonstrate an underexplored vulnerability via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign. Building on this finding, we propose StructAttack, a simple yet effective single-query jailbreak framework under black-box settings. StructAttack decomposes a harmful query into a central topic and a set of benign-looking slot types, then embeds them as structured visual prompts (e.g., mind maps, tables, or sunburst diagrams) with small random perturbations. Paired with a completion-guided instruction, LVLMs automatically recompose the concealed semantics and generate unsafe outputs without triggering safety mechanisms. Although each slot appears benign in isolation (local benignness), StructAttack exploits LVLMs'reasoning to assemble these slots into coherent harmful semantics. Extensive experiments on multiple models and benchmarks show the efficacy of our proposed StructAttack.