Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Multimodal large language models (MLLMs) are vulnerable during inference to preference hijacking attacks induced by imperceptible image perturbations—adversarially optimized inputs can stealthily steer models toward contextually plausible yet deliberately biased outputs, which are neither overtly harmful nor reliably detectable by existing defenses. Method: We propose Phi, the first black-box, inference-time attack framework that achieves targeted preference hijacking solely via input image perturbation—without model modification. Phi introduces transferable universal perturbation components, jointly optimized using a context-aware loss function and a cross-modal attention localization mechanism to enable gradient-driven, efficient adversarial optimization. Results: Extensive experiments demonstrate Phi’s strong effectiveness, high stealthiness, and cross-image generalizability across diverse MLLMs (e.g., LLaVA, Qwen-VL) and tasks (preference selection, stance detection), establishing new benchmarks for multimodal adversarial robustness.

Technology Category

Application Category

📝 Abstract

Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation -- a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.

Problem

Research questions and friction points this paper is trying to address.

Manipulating MLLM output preferences using optimized images

Generating undetectable biased responses through visual hijacking

Creating universal transferable perturbations for preference manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized images manipulate MLLM preferences

Universal perturbation transfers across different images

No model modifications required at inference

🔎 Similar Papers

Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality