HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the safety risks in home environments where seemingly innocuous instructions can trigger hazardous outcomes due to subtle contextual states, a challenge that existing rule-based or prompt-engineering approaches struggle to resolve without sacrificing accuracy or generalization. To this end, we propose HomeGuard—an architecture-agnostic safety mechanism that innovatively integrates Context-Guided Chain-of-Thought (CG-CoT) reasoning with Reinforced Fine-Tuning (RFT). By actively perceiving the interaction target and its spatial neighborhood, HomeGuard enables precise intermediate visual anchoring and robust semantic safety judgments. Evaluated on a newly curated visual grounding dataset through a two-stage training strategy, our method significantly improves risk-matching accuracy—surpassing baseline models by over 30%—while effectively reducing both false positives and false negatives. Furthermore, it provides actionable spatial constraints and explicit obstacle-avoidance anchors for downstream trajectory planning.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where benign commands become hazardous due to subtle environmental states. Existing safeguards often prove inadequate. Rule-based methods lack scalability in object-dense scenes, whereas model-based approaches relying on prompt engineering suffer from unfocused perception, resulting in missed risks or hallucinations. To address this, we propose an architecture-agnostic safeguard featuring Context-Guided Chain-of-Thought (CG-CoT). This mechanism decomposes risk assessment into active perception that sequentially anchors attention to interaction targets and relevant spatial neighborhoods, followed by semantic judgment based on this visual evidence. We support this approach with a curated grounding dataset and a two-stage training strategy utilizing Reinforcement Fine-Tuning (RFT) with process rewards to enforce precise intermediate grounding. Experiments demonstrate that our model HomeGuard significantly enhances safety, improving risk match rates by over 30% compared to base models while reducing oversafety. Beyond hazard detection, the generated visual anchors serve as actionable spatial constraints for downstream planners, facilitating explicit collision avoidance and safety trajectory generation. Code and data are released under https://github.com/AI45Lab/HomeGuard

Problem

Research questions and friction points this paper is trying to address.

contextual safety risk

embodied agents

household tasks

vision-language models

risk identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-Guided Chain-of-Thought

Vision-Language Models

Embodied Safety