🤖 AI Summary
This work proposes the first cross-modal collaborative ranking attack framework against vision-language models (VLMs) in product search, addressing the challenge of simultaneously achieving high effectiveness and strong imperceptibility in multimodal adversarial attacks. By jointly optimizing visually imperceptible image perturbations and semantically natural textual suffixes, the method exploits the intrinsic modality coupling within VLMs to manipulate ranking outcomes. An alternating gradient strategy enables efficient end-to-end generation of adversarial examples directly within the VLM architecture. Experimental results demonstrate that the proposed approach significantly outperforms unimodal baseline attacks in elevating target product rankings while remaining undetected by standard content filtering and detection mechanisms.
📝 Abstract
Vision-Language Models (VLMs) are rapidly replacing unimodal encoders in modern retrieval and recommendation systems. While their capabilities are well-documented, their robustness against adversarial manipulation in competitive ranking scenarios remains largely unexplored. In this paper, we uncover a critical vulnerability in VLM-based product search: multimodal ranking attacks. We present Multimodal Generative Engine Optimization (MGEO), a novel adversarial framework that enables a malicious actor to unfairly promote a target product by jointly optimizing imperceptible image perturbations and fluent textual suffixes. Unlike existing attacks that treat modalities in isolation, MGEO employs an alternating gradient-based optimization strategy to exploit the deep cross-modal coupling within the VLM. Extensive experiments on real-world datasets using state-of-the-art models demonstrate that our coordinated attack significantly outperforms text-only and image-only baselines. These findings reveal that multimodal synergy, typically a strength of VLMs, can be weaponized to compromise the integrity of search rankings without triggering conventional content filters.