Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

To address the lack of dedicated safety benchmarks and efficient input auditing mechanisms for embodied agents, this paper proposes the first end-to-end embodied safety framework. Methodologically, it introduces (1) EAsafetyBench—the first safety benchmark tailored specifically for embodied agents—featuring a multi-dimensional safety taxonomy and adversarial test cases; (2) Pinpoint, a prompt-decoupled auditing mechanism that employs masked attention to isolate functional instructions from safety-critical content, thereby enhancing audit robustness; and (3) a lightweight auditing model coupled with a comprehensive multi-dimensional evaluation suite. Experimental results demonstrate an average detection accuracy of 94.58% and an inference latency of only 0.002 seconds per sample—significantly surpassing current state-of-the-art methods in both accuracy and efficiency.

Technology Category

Application Category

📝 Abstract

Embodied agents exhibit immense potential across a multitude of domains, making the assurance of their behavioral safety a fundamental prerequisite for their widespread deployment. However, existing research predominantly concentrates on the security of general large language models, lacking specialized methodologies for establishing safety benchmarks and input moderation tailored to embodied agents. To bridge this gap, this paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents. This framework encompasses the entire pipeline, including taxonomy definition, dataset curation, moderator architecture, model training, and rigorous evaluation. Notably, we introduce EAsafetyBench, a meticulously crafted safety benchmark engineered to facilitate both the training and stringent assessment of moderators specifically designed for embodied agents. Furthermore, we propose Pinpoint, an innovative prompt-decoupled input moderation scheme that harnesses a masked attention mechanism to effectively isolate and mitigate the influence of functional prompts on moderation tasks. Extensive experiments conducted on diverse benchmark datasets and models validate the feasibility and efficacy of the proposed approach. The results demonstrate that our methodologies achieve an impressive average detection accuracy of 94.58%, surpassing the performance of existing state-of-the-art techniques, alongside an exceptional moderation processing time of merely 0.002 seconds per instance.

Problem

Research questions and friction points this paper is trying to address.

Lack of specialized safety benchmarks for embodied agents

Need for tailored input moderation in embodied agents

Isolating functional prompts' impact on moderation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel input moderation framework for embodied agents

EAsafetyBench for safety benchmark and assessment

Pinpoint with masked attention for prompt isolation

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?