ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

πŸ“… 2025-12-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Multimodal large language models (MLLMs) are vulnerable to indirect prompt injection (IPI) attacks, where malicious instructions are implicitly embedded in images, videos, or audioβ€”posing challenges for existing text-centric defenses due to poor transferability and generalization. This paper introduces the first representation-space regulation framework tailored for MLLMs: we identify that instruction-following behavior concentrates in a specific latent subspace, enabling disentanglement of safety constraints from performance-degrading directions. Our method integrates adaptive-strength activation steering, on-demand activation mechanisms, and post-hoc filtering verification to enable precise, minimal intervention in the representation space. It further unifies steering-based activation modulation, lightweight IPI detection, and multimodal post-processing. Crucially, it preserves original model capabilities while significantly enhancing robustness and cross-modal generalization against diverse IPI attacks.

Technology Category

Application Category

πŸ“ Abstract
Multimodal Large Language Models (MLLMs) are increasingly vulnerable to multimodal Indirect Prompt Injection (IPI) attacks, which embed malicious instructions in images, videos, or audio to hijack model behavior. Existing defenses, designed primarily for text-only LLMs, are unsuitable for countering these multimodal threats, as they are easily bypassed, modality-dependent, or generalize poorly. Inspired by activation steering researches, we hypothesize that a robust, general defense independent of modality can be achieved by steering the model's behavior in the representation space. Through extensive experiments, we discover that the instruction-following behavior of MLLMs is encoded in a subspace. Steering along directions within this subspace can enforce adherence to user instructions, forming the basis of a defense. However, we also found that a naive defense direction could be coupled with a utility-degrading direction, and excessive intervention strength harms model performance. To address this, we propose ARGUS, which searches for an optimal defense direction within the safety subspace that decouples from the utility degradation direction, further combining adaptive strength steering to achieve a better safety-utility trade-off. ARGUS also introduces lightweight injection detection stage to activate the defense on-demand, and a post-filtering stage to verify defense success. Experimental results show that ARGUS can achieve robust defense against multimodal IPI while maximally preserving the MLLM's utility.
Problem

Research questions and friction points this paper is trying to address.

Defends MLLMs against multimodal indirect prompt injection attacks.
Searches optimal defense direction to decouple safety from utility degradation.
Uses adaptive steering and detection for robust safety-utility trade-off.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Steering instruction-following behavior in representation space
Searching optimal defense direction decoupled from utility degradation
Combining adaptive strength steering with on-demand activation
πŸ”Ž Similar Papers
No similar papers found.
W
Weikai Lu
South China University of Technology
Ziqian Zeng
Ziqian Zeng
Associate Professor at South China University of Technology
Natural Language Processing
K
Kehua Zhang
South China University of Technology
H
Haoran Li
Hong Kong University of Science and Technology
Huiping Zhuang
Huiping Zhuang
Associate Professor, South China University of Technology
Continual LearningMulti-ModalEmbodied AILarge Model
Ruidong Wang
Ruidong Wang
Zhejiang Normal University
C
Cen Chen
South China University of Technology
H
Hao Peng
Beihang University