NatSGLD: A Dataset with Speech, Gesture, Logic, and Demonstration for Robot Learning in Natural Human-Robot Interaction

📅 2025-02-23

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing HRI datasets predominantly focus on simple tasks (e.g., object pointing), limiting their utility for complex instruction understanding and interpretable decision-making. To address this, we introduce the first multimodal robot learning dataset explicitly designed for natural human–robot interaction. It is the first to embed Linear Temporal Logic (LTL) formulas as ground-truth task specifications directly into real-world interaction data, covering synchronized speech, gestures, LTL specifications, and high-precision action trajectories. We employ a Wizard-of-Oz paradigm to ensure precise multimodal synchronization and semantic alignment, thereby guaranteeing task diversity and annotation fidelity. The dataset is publicly released under the MIT License. Experiments demonstrate substantial improvements in model robustness to ambiguous and composite instructions. It enables research in multimodal instruction following, intent recognition, logic-guided reinforcement learning, and plan inversion. This work establishes a new benchmark for verifiable and interpretable robot learning.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal Human-Robot Interaction (HRI) datasets emphasize the integration of speech and gestures, allowing robots to absorb explicit knowledge and tacit understanding. However, existing datasets primarily focus on elementary tasks like object pointing and pushing, limiting their applicability to complex domains. They prioritize simpler human command data but place less emphasis on training robots to correctly interpret tasks and respond appropriately. To address these gaps, we present the NatSGLD dataset, which was collected using a Wizard of Oz (WoZ) method, where participants interacted with a robot they believed to be autonomous. NatSGLD records humans' multimodal commands (speech and gestures), each paired with a demonstration trajectory and a Linear Temporal Logic (LTL) formula that provides a ground-truth interpretation of the commanded tasks. This dataset serves as a foundational resource for research at the intersection of HRI and machine learning. By providing multimodal inputs and detailed annotations, NatSGLD enables exploration in areas such as multimodal instruction following, plan recognition, and human-advisable reinforcement learning from demonstrations. We release the dataset and code under the MIT License at https://www.snehesh.com/natsgld/ to support future HRI research.

Problem

Research questions and friction points this paper is trying to address.

Enhances multimodal HRI dataset complexity

Improves robot task interpretation accuracy

Supports advanced HRI and machine learning research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal HRI dataset integration

Wizard of Oz method collection

Linear Temporal Logic annotation

🔎 Similar Papers

A Survey of Language-Based Communication in Robotics