🤖 AI Summary
This work exposes a novel security threat to embodied AI systems (e.g., robotic vehicles) in multimodal language understanding: adversaries can embed deceptive natural-language commands into visual inputs (e.g., traffic signs), exploiting large vision-language models’ (LVLMs) semantic reasoning capabilities to perform “command hijacking” and induce erroneous agent behavior. To this end, we propose the first prompt-based command hijacking attack paradigm, constructing a visual attack prompt dictionary and integrating multimodal prompt engineering, token-space search, and semantic perturbation strategies to generate highly stealthy adversarial inputs. We evaluate our approach on four LVLM-based agents—covering drone landing, autonomous driving, and aerial target tracking—as well as on real robotic platforms. Results demonstrate significantly higher attack success rates than state-of-the-art methods, systematically revealing, for the first time, embodied AI’s semantic-level (rather than pixel-level) vulnerability.
📝 Abstract
Embodied Artificial Intelligence (AI) promises to handle edge cases in robotic vehicle systems where data is scarce by using common-sense reasoning grounded in perception and action to generalize beyond training distributions and adapt to novel real-world situations. These capabilities, however, also create new security risks. In this paper, we introduce CHAI (Command Hijacking against embodied AI), a new class of prompt-based attacks that exploit the multimodal language interpretation abilities of Large Visual-Language Models (LVLMs). CHAI embeds deceptive natural language instructions, such as misleading signs, in visual input, systematically searches the token space, builds a dictionary of prompts, and guides an attacker model to generate Visual Attack Prompts. We evaluate CHAI on four LVLM agents; drone emergency landing, autonomous driving, and aerial object tracking, and on a real robotic vehicle. Our experiments show that CHAI consistently outperforms state-of-the-art attacks. By exploiting the semantic and multimodal reasoning strengths of next-generation embodied AI systems, CHAI underscores the urgent need for defenses that extend beyond traditional adversarial robustness.