🤖 AI Summary
This work identifies and formalizes a novel adversarial vulnerability—“action freezing”—where adversarial images cause Vision-Language-Action (VLA) models to enter persistent response stagnation, ignoring subsequent linguistic instructions and rendering robots inactive during critical intervention phases. To address this, we propose the first systematic adversarial attack framework tailored for VLA models, featuring a minimax bilevel optimization method that generates adversarial images with high attack success rates and strong cross-instruction transferability. Extensive experiments across three state-of-the-art VLA models and four robotic benchmarks demonstrate an average attack success rate of 76.2%, confirming a substantial risk of operational paralysis in real-world deployments of multimodal embodied AI systems. Our work establishes both theoretical foundations and empirical evidence for security evaluation and robustness enhancement of VLA models.
📝 Abstract
Vision-Language-Action (VLA) models are driving rapid progress in robotics by enabling agents to interpret multimodal inputs and execute complex, long-horizon tasks. However, their safety and robustness against adversarial attacks remain largely underexplored. In this work, we identify and formalize a critical adversarial vulnerability in which adversarial images can "freeze" VLA models and cause them to ignore subsequent instructions. This threat effectively disconnects the robot's digital mind from its physical actions, potentially inducing inaction during critical interventions. To systematically study this vulnerability, we propose FreezeVLA, a novel attack framework that generates and evaluates action-freezing attacks via min-max bi-level optimization. Experiments on three state-of-the-art VLA models and four robotic benchmarks show that FreezeVLA attains an average attack success rate of 76.2%, significantly outperforming existing methods. Moreover, adversarial images generated by FreezeVLA exhibit strong transferability, with a single image reliably inducing paralysis across diverse language prompts. Our findings expose a critical safety risk in VLA models and highlight the urgent need for robust defense mechanisms.