EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses a critical gap in the evaluation of embodied intelligent systems, which has predominantly focused on task success rates while neglecting governance capabilities—such as policy compliance, safety recovery, and responsiveness to human intervention. To remedy this, the study introduces governance capability as a first-class evaluation objective and proposes a comprehensive benchmark encompassing seven dimensions, including controllability, recoverability, and evolutionary safety. The framework supports standardized testing in both single-agent and multi-agent environments through a modular runtime architecture, contract-aware upgrade mechanisms, scenario templates, and perturbation operators. By providing a complete suite of metrics and baseline protocols, this benchmark establishes a quantitative foundation for the safe and controllable advancement of embodied intelligent systems.

Technology Category

Application Category

📝 Abstract

Recent progress in embodied AI has produced a growing ecosystem of robot policies, foundation models, and modular runtimes. However, current evaluation remains dominated by task success metrics such as completion rate or manipulation accuracy. These metrics leave a critical gap: they do not measure whether embodied systems are governable -- whether they respect capability boundaries, enforce policies, recover safely, maintain audit trails, and respond to human oversight. We present EmbodiedGovBench, a benchmark for governance-oriented evaluation of embodied agent systems. Rather than asking only whether a robot can complete a task, EmbodiedGovBench evaluates whether the system remains controllable, policy-bounded, recoverable, auditable, and evolution-safe under realistic perturbations. The benchmark covers seven governance dimensions: unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness. We define a benchmark structure spanning single-robot and fleet settings, with scenario templates, perturbation operators, governance metrics, and baseline evaluation protocols. We describe how the benchmark can be instantiated over embodied capability runtimes with modular interfaces and contract-aware upgrade workflows. Our analysis suggests that embodied governance should become a first-class evaluation target. EmbodiedGovBench provides the initial measurement framework for that shift.

Problem

Research questions and friction points this paper is trying to address.

embodied AI

governance

benchmark

safety

recoverability

Innovation

Methods, ideas, or system contributions that make the work stand out.

embodied governance

safety benchmark

policy enforcement

recovery robustness

upgrade safety

🔎 Similar Papers

Large Model Based Agents: State-of-the-Art, Cooperation Paradigms, Security and Privacy, and Future Trends

2024-09-22Citations: 0

Vulnerabilities that arise from poor governance in Distributed Ledger Technologies

2024-09-24arXiv.orgCitations: 0

From homeostasis to resource sharing: Biologically and economically aligned multi-objective multi-agent AI safety benchmarks

2024-09-30Citations: 0

Authors to Follow