The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This paper identifies and systematically investigates *indirect environmental jailbreaking* (IEJ) in embedded AI—a novel threat where adversaries bypass safety mechanisms not by directly injecting malicious prompts into the agent, but by embedding adversarial instructions (e.g., text on walls) within the physical or simulated environment, exploiting the agent’s blind trust in its environmental perception. To address this, we formalize IEJ, propose SHAWSHANK—an automated attack generation framework integrating multimodal prompt injection, vision-language model (VLM) security analysis, and black-box environmental perception modeling—and introduce SHAWSHANK-FORGE, a benchmark construction framework. We release SHAWSHANK-BENCH, the first IEJ evaluation benchmark. Evaluated across 3,957 task-scenario combinations, our attacks successfully jailbreak all six mainstream VLMs, significantly outperforming 11 baseline methods. Existing defenses exhibit only limited mitigation capability against IEJ.

Technology Category

Application Category

📝 Abstract

The adoption of Vision-Language Models (VLMs) in embodied AI agents, while being effective, brings safety concerns such as jailbreaking. Prior work have explored the possibility of directly jailbreaking the embodied agents through elaborated multi-modal prompts. However, no prior work has studied or even reported indirect jailbreaks in embodied AI, where a black-box attacker induces a jailbreak without issuing direct prompts to the embodied agent. In this paper, we propose, for the first time, indirect environmental jailbreak (IEJ), a novel attack to jailbreak embodied AI via indirect prompt injected into the environment, such as malicious instructions written on a wall. Our key insight is that embodied AI does not ''think twice'' about the instructions provided by the environment -- a blind trust that attackers can exploit to jailbreak the embodied agent. We further design and implement open-source prototypes of two fully-automated frameworks: SHAWSHANK, the first automatic attack generation framework for the proposed attack IEJ; and SHAWSHANK-FORGE, the first automatic benchmark generation framework for IEJ. Then, using SHAWSHANK-FORGE, we automatically construct SHAWSHANK-BENCH, the first benchmark for indirectly jailbreaking embodied agents. Together, our two frameworks and one benchmark answer the questions of what content can be used for malicious IEJ instructions, where they should be placed, and how IEJ can be systematically evaluated. Evaluation results show that SHAWSHANK outperforms eleven existing methods across 3,957 task-scene combinations and compromises all six tested VLMs. Furthermore, current defenses only partially mitigate our attack, and we have responsibly disclosed our findings to all affected VLM vendors.

Problem

Research questions and friction points this paper is trying to address.

Studying indirect environmental jailbreaks in embodied AI agents

Developing automated frameworks to generate attacks and benchmarks

Evaluating security vulnerabilities in vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Indirect environmental jailbreaks exploit blind trust

Automatic attack generation framework SHAWSHANK developed

Benchmark SHAWSHANK-BENCH evaluates indirect jailbreak effectiveness

🔎 Similar Papers

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI

2024-07-09IEEE/ASME transactions on mechatronicsCitations: 94

A Role of Environmental Complexity on Representation Learning in Deep Reinforcement Learning Agents

2024-07-03arXiv.orgCitations: 1

Uber

For New York, NY-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For San Francisco, CA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For Seattle, WA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For Sunnyvale, CA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year.

New York, NY, USA / San Francisco, CA, USA / Seattle, WA, USA

Research Engineer - Perception and Machine Learning