NVP-HRI: Zero shot natural voice and posture-based human-robot interaction via large language model

📅 2025-03-12
🏛️ Expert systems with applications
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing HRI systems rely on predefined gestures or linguistic tokens, limiting generalization to novel objects and posing usability challenges—particularly for elderly users. This paper proposes a zero-shot, multimodal human–robot interaction framework that enables real-time, natural fusion of speech and human pose understanding and execution without any task-specific training. Our approach integrates a multimodal large language model (MLLM), joint cross-modal embedding, zero-shot prompt engineering, and a lightweight real-time pose estimator (MediaPipe + Transformer) to achieve LLM-driven intent decoding and ROS2-based robot control. Key contributions include: (1) the first zero-shot speech–pose dual-modality alignment mechanism, eliminating reliance on supervised fine-tuning; and (2) end-to-end zero-shot multimodal grounding. Evaluated across five daily tasks, our system achieves a mean accuracy of 92.3% and an end-to-end latency under 480 ms, demonstrating strong cross-task generalization to unseen instruction–pose combinations.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Enables zero-shot interaction with new objects using voice and posture.
Reduces reliance on predefined gestures and language tokens for HRI.
Improves efficiency in human-robot interaction for diverse real-world tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines voice commands and deictic posture
Uses Segment Anything Model for object representation
Integrates large language model for multimodal commands
Y
Yuzhi Lai
University Reutlingen, Alteburgstraße 150, Reutlingen, 72762, Germany
S
Shenghai Yuan
Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, Singapore
Youssef Nassar
Youssef Nassar
Research Associate, Reutlinegn University
Deep LearningComputer VisionRobotics
M
Mingyu Fan
Donghua University, 849 Zhongshan West Street 9, Shanghai, 200051, China
Thomas Weber
Thomas Weber
LMU Munich
Human Computer InteractionSoftware EngineeringHuman-Centered AIHuman-AI Co-Creation
Matthias Rätsch
Matthias Rätsch
Reutlingen University
Human Robot CollaborationFace AnalysisComputer VisionMachine LearningArtificial Intelligence