When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes a critical vulnerability of Vision-Language-Action (VLA) models under multimodal adversarial attacks: prior studies focus on unimodal perturbations, overlooking how cross-modal misalignment fundamentally disrupts embodied reasoning. To address this, we propose VLA-Fool—the first unified multimodal adversarial framework enabling semantic-guided, automated prompt attacks. It jointly models textual perturbations, visual patch/noise injections, and cross-modal alignment failure. By coupling gradient-based optimization with semantic-space constraints, VLA-Fool performs targeted, perception-aware joint attacks on VLAs. Experiments on the LIBERO benchmark demonstrate that minute cross-modal perturbations induce significant decision shifts in OpenVLA—revealing, for the first time, systematic structural weaknesses in its multimodal alignment robustness. Our work establishes a new paradigm for security evaluation of embodied AI systems.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.
Problem

Research questions and friction points this paper is trying to address.

Investigating adversarial robustness of vision-language-action models in embodied environments
Addressing cross-modal misalignment attacks that disrupt perception-instruction correspondence
Developing unified multimodal attacks under both white-box and black-box settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal adversarial attacks on vision-language-action models
Cross-modal misalignment attacks disrupting semantic correspondence
Automatically crafted semantically guided prompting framework
🔎 Similar Papers
No similar papers found.
Y
Yuping Yan
TGAI Lab, School of Engineering, Westlake University
Yuhan Xie
Yuhan Xie
PhD student, EPFL
Deep LearningTime Series AnalysisBrain Computer Interface
Y
Yinxin Zhang
Pennsylvania State University
Lingjuan Lyu
Lingjuan Lyu
Sony
Foundation ModelsFederated LearningResponsible AI
Y
Yaochu Jin
TGAI Lab, School of Engineering, Westlake University