CHAI: Command Hijacking against embodied AI

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work exposes a novel security threat to embodied AI systems (e.g., robotic vehicles) in multimodal language understanding: adversaries can embed deceptive natural-language commands into visual inputs (e.g., traffic signs), exploiting large vision-language models’ (LVLMs) semantic reasoning capabilities to perform “command hijacking” and induce erroneous agent behavior. To this end, we propose the first prompt-based command hijacking attack paradigm, constructing a visual attack prompt dictionary and integrating multimodal prompt engineering, token-space search, and semantic perturbation strategies to generate highly stealthy adversarial inputs. We evaluate our approach on four LVLM-based agents—covering drone landing, autonomous driving, and aerial target tracking—as well as on real robotic platforms. Results demonstrate significantly higher attack success rates than state-of-the-art methods, systematically revealing, for the first time, embodied AI’s semantic-level (rather than pixel-level) vulnerability.

Technology Category

Application Category

📝 Abstract

Embodied Artificial Intelligence (AI) promises to handle edge cases in robotic vehicle systems where data is scarce by using common-sense reasoning grounded in perception and action to generalize beyond training distributions and adapt to novel real-world situations. These capabilities, however, also create new security risks. In this paper, we introduce CHAI (Command Hijacking against embodied AI), a new class of prompt-based attacks that exploit the multimodal language interpretation abilities of Large Visual-Language Models (LVLMs). CHAI embeds deceptive natural language instructions, such as misleading signs, in visual input, systematically searches the token space, builds a dictionary of prompts, and guides an attacker model to generate Visual Attack Prompts. We evaluate CHAI on four LVLM agents; drone emergency landing, autonomous driving, and aerial object tracking, and on a real robotic vehicle. Our experiments show that CHAI consistently outperforms state-of-the-art attacks. By exploiting the semantic and multimodal reasoning strengths of next-generation embodied AI systems, CHAI underscores the urgent need for defenses that extend beyond traditional adversarial robustness.

Problem

Research questions and friction points this paper is trying to address.

Exploits multimodal language interpretation in LVLMs

Embeds deceptive instructions in visual inputs

Targets embodied AI systems like autonomous vehicles

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploits multimodal language interpretation for attacks

Embeds deceptive instructions in visual inputs

Systematically searches token space for prompts

🔎 Similar Papers

Large Model Based Agents: State-of-the-Art, Cooperation Paradigms, Security and Privacy, and Future Trends