NMM-HRI: Natural Multi-modal Human-Robot Interaction with Voice and Deictic Posture via Large Language Model

📅 2025-01-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of natural interaction and high cognitive load for older adults when using service robots, this paper proposes a voice–pointing gesture multimodal interaction system tailored for elderly users. Methodologically, we introduce the first joint modeling framework integrating speech with depth-sensing pointing gestures (finger/arm orientation); incorporate a grammar-constrained mechanism to mitigate large language model hallucinations and ensure command safety; and integrate YOLOv8-based object detection, depth-map-driven bounding box estimation, fine-tuned Qwen2 language modeling, temporally aligned multimodal fusion, and structured action decoding. Evaluated on a UR3e robotic platform, our system achieves a 37.2% improvement in task accuracy and demonstrates significantly enhanced robustness over unimodal baselines. All source code and design documentation are publicly released.

Technology Category

Application Category

📝 Abstract
Translating human intent into robot commands is crucial for the future of service robots in an aging society. Existing Human-Robot Interaction (HRI) systems relying on gestures or verbal commands are impractical for the elderly due to difficulties with complex syntax or sign language. To address the challenge, this paper introduces a multi-modal interaction framework that combines voice and deictic posture information to create a more natural HRI system. The visual cues are first processed by the object detection model to gain a global understanding of the environment, and then bounding boxes are estimated based on depth information. By using a large language model (LLM) with voice-to-text commands and temporally aligned selected bounding boxes, robot action sequences can be generated, while key control syntax constraints are applied to avoid potential LLM hallucination issues. The system is evaluated on real-world tasks with varying levels of complexity using a Universal Robots UR3e manipulator. Our method demonstrates significantly better performance in HRI in terms of accuracy and robustness. To benefit the research community and the general public, we will make our code and design open-source.
Problem

Research questions and friction points this paper is trying to address.

Natural Language Understanding
Gesture Recognition
Human-Robot Interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

NMM-HRI
Natural Language Processing
Visual Object Recognition
Y
Yuzhi Lai
University Reutlingen, Alteburgstraße 150, 72762 Germany
S
Shenghai Yuan
Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798
Youssef Nassar
Youssef Nassar
Research Associate, Reutlinegn University
Deep LearningComputer VisionRobotics
M
Mingyu Fan
Donghua University, 849 Zhongshan West Street, Shanghai 200051
A
Atmaraaj Gopal
Neura Robotics GmbH, 44 Gutenbergstraße, Metzingen 72555
A
Arihiro Yorita
Kwansei Gakuin University, 1-155 Uegahara 1bancho, Hyogo 662-8501
Naoyuki Kubota
Naoyuki Kubota
Tokyo Metropolitan University
RoboticsComputational Intelligence
M
Matthias Ratsch
University Reutlingen, Alteburgstraße 150, 72762 Germany