Jailbreaking Embodied LLMs via Action-level Manipulation

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses a critical gap between language-level safety mechanisms and real-world physical consequences in embodied large language models (LLMs), demonstrating that semantically innocuous instructions can trigger hazardous physical actions. To exploit this vulnerability, the authors propose Blindfold, a novel action-level jailbreaking framework based on adversarial agent planning. Blindfold leverages a local proxy model to generate behavior sequences that appear safe but are covertly harmful, injects subtle action perturbations to evade detection, and integrates a rule-based verifier to enhance execution feasibility. Extensive experiments on both simulated and real 6DoF robotic arms show that Blindfold achieves up to a 53% higher attack success rate compared to state-of-the-art baselines, underscoring the insufficiency of relying solely on linguistic safeguards and highlighting the urgent need for consequence-aware safety mechanisms in embodied AI systems.

Technology Category

Application Category

📝 Abstract

Embodied Large Language Models (LLMs) enable AI agents to interact with the physical world through natural language instructions and actions. However, beyond the language-level risks inherent to LLMs themselves, embodied LLMs with real-world actuation introduce a new vulnerability: instructions that appear semantically benign may still lead to dangerous real-world consequences, revealing a fundamental misalignment between linguistic security and physical outcomes. In this paper, we introduce Blindfold, an automated attack framework that leverages the limited causal reasoning capabilities of embodied LLMs in real-world action contexts. Rather than iterative trial-and-error jailbreaking of black-box embodied LLMs, Blindfold adopts an Adversarial Proxy Planning strategy: it compromises a local surrogate LLM to perform action-level manipulations that appear semantically safe but could result in harmful physical effects when executed. Blindfold further conceals key malicious actions by injecting carefully crafted noise to evade detection by defense mechanisms, and it incorporates a rule-based verifier to improve the attack executability. Evaluations on both embodied AI simulators and a real-world 6DoF robotic arm show that Blindfold achieves up to 53% higher attack success rates than SOTA baselines, highlighting the urgent need to move beyond surface-level language censorship and toward consequence-aware defense mechanisms to secure embodied LLMs.

Problem

Research questions and friction points this paper is trying to address.

Embodied LLMs

jailbreaking

action-level manipulation

physical safety

adversarial attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

embodied LLMs

action-level manipulation

adversarial proxy planning