Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents

📅 2024-10-18

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Insufficient safety-critical scene understanding and responsive capability hinder the reliable deployment of embodied agents in real-world environments. To address this, we propose M-CoDAL, a multimodal safety-aware dialogue system that introduces the first discourse-coherence-driven multimodal dialogue framework, integrating large language models (LLMs), large multimodal models (LMMs), and a manually curated 1K-sample Reddit safety image dataset. We further design an LLM-guided clustering-based active learning mechanism to efficiently identify high-informativeness training samples. Experimental results demonstrate that M-CoDAL significantly improves safety scenario resolution rate, user sentiment alignment, and dialogue safety in automated evaluation. Real-world robotic deployment confirms its superior intervention persuasiveness and outperforms the ChatGPT baseline across key safety and interaction metrics.

Technology Category

Application Category

📝 Abstract

When assisting people in daily tasks, robots need to accurately interpret visual cues and respond effectively in diverse safety-critical situations, such as sharp objects on the floor. In this context, we present M-CoDAL, a multimodal-dialogue system specifically designed for embodied agents to better understand and communicate in safety-critical situations. The system leverages discourse coherence relations to enhance its contextual understanding and communication abilities. To train this system, we introduce a novel clustering-based active learning mechanism that utilizes an external Large Language Model (LLM) to identify informative instances. Our approach is evaluated using a newly created multimodal dataset comprising 1K safety violations extracted from 2K Reddit images. These violations are annotated using a Large Multimodal Model (LMM) and verified by human annotators. Results with this dataset demonstrate that our approach improves resolution of safety situations, user sentiment, as well as safety of the conversation. Next, we deploy our dialogue system on a Hello Robot Stretch robot and conduct a within-subject user study with real-world participants. In the study, participants role-play two safety scenarios with different levels of severity with the robot and receive interventions from our model and a baseline system powered by OpenAI's ChatGPT. The study results corroborate and extend the findings from the automated evaluation, showing that our proposed system is more persuasive in a real-world embodied agent setting.

Problem

Research questions and friction points this paper is trying to address.

Enhance robot safety communication

Active learning for safety-critical tasks

Multimodal dialogue system effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal-dialogue system for safety

Clustering-based active learning mechanism

Large Language Model for training

🔎 Similar Papers

GOMA: Proactive Embodied Cooperative Communication via Goal-Oriented Mental Alignment