🤖 AI Summary
Multimodal large language models (MLLMs) are vulnerable during inference to preference hijacking attacks induced by imperceptible image perturbations—adversarially optimized inputs can stealthily steer models toward contextually plausible yet deliberately biased outputs, which are neither overtly harmful nor reliably detectable by existing defenses.
Method: We propose Phi, the first black-box, inference-time attack framework that achieves targeted preference hijacking solely via input image perturbation—without model modification. Phi introduces transferable universal perturbation components, jointly optimized using a context-aware loss function and a cross-modal attention localization mechanism to enable gradient-driven, efficient adversarial optimization.
Results: Extensive experiments demonstrate Phi’s strong effectiveness, high stealthiness, and cross-image generalizability across diverse MLLMs (e.g., LLaVA, Qwen-VL) and tasks (preference selection, stance detection), establishing new benchmarks for multimodal adversarial robustness.
📝 Abstract
Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation -- a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.