HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing safety evaluation methods struggle to effectively detect dynamic unsafe behaviors of embodied agents in home environments and lack fine-grained, multidimensional benchmarks grounded in real-world scenarios. To address this gap, this work proposes HomeSafe-Bench—the first dynamic safety evaluation benchmark encompassing six household zones and 438 diverse cases—and introduces HD-Guard, a hierarchical streaming architecture that integrates a lightweight FastBrain with a large-model-based SlowBrain to enable low-latency, high-accuracy multimodal real-time monitoring. Leveraging a dataset synthesized through physical simulation and advanced video generation techniques, experiments not only demonstrate HD-Guard’s superior trade-off between latency and performance but also uncover critical limitations of current vision-language models in household safety tasks, thereby charting new directions for future research.

Technology Category

Application Category

📝 Abstract

The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.

Problem

Research questions and friction points this paper is trying to address.

unsafe action detection

embodied agents

household safety

vision-language models

safety evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

unsafe action detection

embodied agents