HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the safety risks in home environments where seemingly innocuous instructions can trigger hazardous outcomes due to subtle contextual states, a challenge that existing rule-based or prompt-engineering approaches struggle to resolve without sacrificing accuracy or generalization. To this end, we propose HomeGuard—an architecture-agnostic safety mechanism that innovatively integrates Context-Guided Chain-of-Thought (CG-CoT) reasoning with Reinforced Fine-Tuning (RFT). By actively perceiving the interaction target and its spatial neighborhood, HomeGuard enables precise intermediate visual anchoring and robust semantic safety judgments. Evaluated on a newly curated visual grounding dataset through a two-stage training strategy, our method significantly improves risk-matching accuracy—surpassing baseline models by over 30%—while effectively reducing both false positives and false negatives. Furthermore, it provides actionable spatial constraints and explicit obstacle-avoidance anchors for downstream trajectory planning.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where benign commands become hazardous due to subtle environmental states. Existing safeguards often prove inadequate. Rule-based methods lack scalability in object-dense scenes, whereas model-based approaches relying on prompt engineering suffer from unfocused perception, resulting in missed risks or hallucinations. To address this, we propose an architecture-agnostic safeguard featuring Context-Guided Chain-of-Thought (CG-CoT). This mechanism decomposes risk assessment into active perception that sequentially anchors attention to interaction targets and relevant spatial neighborhoods, followed by semantic judgment based on this visual evidence. We support this approach with a curated grounding dataset and a two-stage training strategy utilizing Reinforcement Fine-Tuning (RFT) with process rewards to enforce precise intermediate grounding. Experiments demonstrate that our model HomeGuard significantly enhances safety, improving risk match rates by over 30% compared to base models while reducing oversafety. Beyond hazard detection, the generated visual anchors serve as actionable spatial constraints for downstream planners, facilitating explicit collision avoidance and safety trajectory generation. Code and data are released under https://github.com/AI45Lab/HomeGuard
Problem

Research questions and friction points this paper is trying to address.

contextual safety risk
embodied agents
household tasks
vision-language models
risk identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-Guided Chain-of-Thought
Vision-Language Models
Embodied Safety
Reinforcement Fine-Tuning
Visual Grounding
🔎 Similar Papers
No similar papers found.
X
Xiaoya Lu
Shanghai AI Laboratory
Y
Yijin Zhou
Shanghai AI Laboratory
Z
Zeren Chen
Beihang University
R
Ruocheng Wang
Shanghai Jiao Tong University
B
Bingrui Sima
Huazhong University of Science and Technology
Enshen Zhou
Enshen Zhou
Beihang University
Embodied AIEmbodied AgentRobot LearningGenerative Model
Lu Sheng
Lu Sheng
School of Software, Beihang University
Embodied AI3D VisionMachine Learning
D
Dongrui Liu
Shanghai AI Laboratory
Jing Shao
Jing Shao
Research Scientist, Shanghai AI Laboratory/Shanghai Jiao Tong University
Computer VisionMulti-Modal Large Language Model