AUV-Fusion: Cross-Modal Adversarial Fusion of User Interactions and Visual Perturbations Against VARS

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual-aware recommendation systems (VARS) suffer from insufficient robustness against adversarial attacks: fake-user injection incurs high cost and is easily detectable, while purely visual perturbations struggle to model user preferences effectively, limiting both stealthiness and attack efficacy. To address this, we propose AUV-Fusion—the first cross-modal adversarial attack framework tailored for VARS. Unlike prior approaches, AUV-Fusion avoids injecting synthetic users; instead, it jointly optimizes higher-order multi-hop user interaction modeling and vision–semantics alignment to generate semantically coherent and preference-consistent cross-modal adversarial examples within the latent spaces of pretrained VAEs and diffusion models. Evaluated across multiple VARS architectures and real-world datasets, AUV-Fusion significantly boosts exposure rates for cold-start items while maintaining exceptional stealthiness. This work establishes a novel paradigm for security evaluation of VARS, bridging multimodal representation learning and adversarial robustness in recommender systems.

Technology Category

Application Category

📝 Abstract
Modern Visual-Aware Recommender Systems (VARS) exploit the integration of user interaction data and visual features to deliver personalized recommendations with high precision. However, their robustness against adversarial attacks remains largely underexplored, posing significant risks to system reliability and security. Existing attack strategies suffer from notable limitations: shilling attacks are costly and detectable, and visual-only perturbations often fail to align with user preferences. To address these challenges, we propose AUV-Fusion, a cross-modal adversarial attack framework that adopts high-order user preference modeling and cross-modal adversary generation. Specifically, we obtain robust user embeddings through multi-hop user-item interactions and transform them via an MLP into semantically aligned perturbations. These perturbations are injected onto the latent space of a pre-trained VAE within the diffusion model. By synergistically integrating genuine user interaction data with visually plausible perturbations, AUV-Fusion eliminates the need for injecting fake user profiles and effectively mitigates the challenge of insufficient user preference extraction inherent in traditional visual-only attacks. Comprehensive evaluations on diverse VARS architectures and real-world datasets demonstrate that AUV-Fusion significantly enhances the exposure of target (cold-start) items compared to conventional baseline methods. Moreover, AUV-Fusion maintains exceptional stealth under rigorous scrutiny.
Problem

Research questions and friction points this paper is trying to address.

Enhancing adversarial attacks on Visual-Aware Recommender Systems
Overcoming limitations of shilling and visual-only attacks
Synergizing user interactions and visual perturbations effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-order user preference modeling for robust embeddings
Cross-modal adversary generation via MLP transformation
Diffusion model VAE latent space perturbation injection
🔎 Similar Papers
No similar papers found.
H
Hai Ling
Communication University of China
T
Tianchi Wang
Communication University of China
Xiaohao Liu
Xiaohao Liu
National University of Singapore
Multimodal LearningInformation Retrieval
Z
Zhulin Tao
Communication University of China
L
Lifang Yang
Communication University of China
X
Xianglin Huang
Communication University of China