🤖 AI Summary
Existing HRI systems rely on predefined gestures or linguistic tokens, limiting generalization to novel objects and posing usability challenges—particularly for elderly users. This paper proposes a zero-shot, multimodal human–robot interaction framework that enables real-time, natural fusion of speech and human pose understanding and execution without any task-specific training. Our approach integrates a multimodal large language model (MLLM), joint cross-modal embedding, zero-shot prompt engineering, and a lightweight real-time pose estimator (MediaPipe + Transformer) to achieve LLM-driven intent decoding and ROS2-based robot control. Key contributions include: (1) the first zero-shot speech–pose dual-modality alignment mechanism, eliminating reliance on supervised fine-tuning; and (2) end-to-end zero-shot multimodal grounding. Evaluated across five daily tasks, our system achieves a mean accuracy of 92.3% and an end-to-end latency under 480 ms, demonstrating strong cross-task generalization to unseen instruction–pose combinations.