๐ค AI Summary
This work addresses the security threat posed by visual prompt injection attacks, which exploit adversarial images to compromise multimodal large language models. Existing defenses struggle to balance efficacy and efficiency due to insufficient understanding of the underlying attack mechanisms. The study reveals that such attacks rely on only a few critical image tokens and introduces a scoring mechanism based on hidden-state gradient norm to identify them. This approach overcomes the limitation of conventional attribution methods, which fail when model predictions remain unchanged, and provides theoretical guarantees for localization accuracy. Requiring just a single forwardโbackward pass, the method efficiently masks critical tokens, reducing attack success rates to near zero across diverse visual prompt injection and multimodal jailbreaking scenarios, while preserving normal model performance with negligible computational overhead.
๐ Abstract
Adversarial images pose a severe security threat to multimodal large language models through prompt injection. Existing defenses largely lack a principled understanding of the underlying mechanisms and struggle to balance efficiency and defense utility. In this work, we show that successful adversarial attacks do not rely on the entire image uniformly but instead depend on a small subset of critical image tokens. Based on this insight, we propose Gradient Token Masking (GTM), which localizes these tokens via gradient analysis and neutralizes them through masking. We find that attribution based on the first generated token's output probability fails when attacks preserve the predicted token. To overcome this, GTM utilizes the Hidden-State Gradient Norm score for generation-influence attribution under adversarial inputs. We prove that its ranking is consistent with that of the full adversarial loss gradient, providing a theoretical guarantee for accurate localization. Our method requires only a single forward-backward pass to identify and zero out a small number of high-scoring tokens, effectively disrupting the adversarial attack path. Extensive experiments on prompt injection and multimodal jailbreak attacks demonstrate that our approach reduces attack success rates (ASR) to near zero while preserving model utility with negligible computational overhead.