π€ AI Summary
Existing prompt injection attacks struggle to jointly manipulate both textual and visual components of large vision-language models through perturbations in a single modality. This work proposes CrossMPI, the first attack method that leverages only image perturbations to steer the modelβs holistic interpretation of multimodal inputs. CrossMPI extends the optimization objective from the visual embedding space to the broader multimodal hidden state space and innovatively identifies the optimal perturbation layer in the middle layers of the model. It further introduces a distance-decaying perturbation budget allocation mechanism alongside a critical-layer selection strategy. Extensive experiments demonstrate that CrossMPI significantly outperforms current baselines across multiple mainstream models and datasets, confirming its effectiveness and strong generalization capability.
π Abstract
Large vision-language models (LVLMs) have emerged as a powerful paradigm for multimodal intelligence, but their growing deployment also expands the attack surface of prompt injection. Despite this growing concern, existing attacks still suffer from a critical limitation: the injected prompt for one modality only steers the model's interpretation of that singular input. Alternatively, these attacks remain multimodal but fail to achieve cross-modal prompt perturbation. To bridge this gap, we introduce a novel cross-modal prompt injection attack CrossMPI, which can steer the model's interpretation of both textual and visual inputs via image-only prompt injection. Our design is underpinned by the following key breakthroughs. First, we turn the focus of the injected prompt perturbation optimization from the visual embedding space (typically with only $10^5$ parameters) to the model hidden state space (for multimodal information integration and with $10^7$ parameters). Then, two strategies are adopted to mitigate the optimization challenges posed by the larger parameter space. To constrain the optimized model parameter space, we introduce a layer selection strategy that identifies the layers most critical to multimodal integration. Interestingly, deviating from the past experience, our analysis reveals that the optimal layers for LVLM prompt perturbation reside in the middle of the model rather than the last. To constrain the image perturbation space, we propose a new distance-decremental perturbation budget assignment strategy that allocates budgets decrementally as the pixel distance to semantic-critical regions increases. Extensive experiments across multiple LVLMs and datasets show that our method significantly outperforms baseline approaches.