NMM-HRI: Natural Multi-modal Human-Robot Interaction with Voice and Deictic Posture via Large Language Model

📅 2025-01-01

📈 Citations: 0

✨ Influential: 0

career value

255K/year

🤖 AI Summary

To address the challenges of natural interaction and high cognitive load for older adults when using service robots, this paper proposes a voice–pointing gesture multimodal interaction system tailored for elderly users. Methodologically, we introduce the first joint modeling framework integrating speech with depth-sensing pointing gestures (finger/arm orientation); incorporate a grammar-constrained mechanism to mitigate large language model hallucinations and ensure command safety; and integrate YOLOv8-based object detection, depth-map-driven bounding box estimation, fine-tuned Qwen2 language modeling, temporally aligned multimodal fusion, and structured action decoding. Evaluated on a UR3e robotic platform, our system achieves a 37.2% improvement in task accuracy and demonstrates significantly enhanced robustness over unimodal baselines. All source code and design documentation are publicly released.

Technology Category

Application Category

📝 Abstract

Translating human intent into robot commands is crucial for the future of service robots in an aging society. Existing Human-Robot Interaction (HRI) systems relying on gestures or verbal commands are impractical for the elderly due to difficulties with complex syntax or sign language. To address the challenge, this paper introduces a multi-modal interaction framework that combines voice and deictic posture information to create a more natural HRI system. The visual cues are first processed by the object detection model to gain a global understanding of the environment, and then bounding boxes are estimated based on depth information. By using a large language model (LLM) with voice-to-text commands and temporally aligned selected bounding boxes, robot action sequences can be generated, while key control syntax constraints are applied to avoid potential LLM hallucination issues. The system is evaluated on real-world tasks with varying levels of complexity using a Universal Robots UR3e manipulator. Our method demonstrates significantly better performance in HRI in terms of accuracy and robustness. To benefit the research community and the general public, we will make our code and design open-source.

Problem

Research questions and friction points this paper is trying to address.

Natural Language Understanding

Gesture Recognition

Human-Robot Interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

NMM-HRI

Natural Language Processing

Visual Object Recognition

🔎 Similar Papers

Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task