🤖 AI Summary
Existing human-robot interaction (HRI) systems rely on single-modal inputs—such as gesture or speech—leading to high ambiguity, low efficiency, and poor accessibility for users with motor impairments. To address these limitations, this work proposes a lightweight gaze-speech bimodal interaction framework tailored for users with physical mobility constraints. Leveraging Meta ARIA smart glasses, the system captures real-time eye-tracking and speech signals, integrating temporal eye-movement modeling, adaptive gaze-duration estimation, vision-language alignment, and large language model (LLM)-driven intent parsing with contextual scene injection—all executed on-device. This establishes the first on-device LLM-powered real-time HRI paradigm, effectively suppressing ocular noise. Evaluation demonstrates >92% task success rate and sub-2.1-second average interaction latency. The framework exhibits exceptional robustness and usability in trials with physically impaired users.
📝 Abstract
Effective Human-Robot Interaction (HRI) is crucial for enhancing accessibility and usability in real-world robotics applications. However, existing solutions often rely on gestures or language commands, making interaction inefficient and ambiguous, particularly for users with physical impairments. In this paper, we introduce FAM-HRI, an efficient multi-modal framework for human-robot interaction that integrates language and gaze inputs via foundation models. By leveraging lightweight Meta ARIA glasses, our system captures real-time multi-modal signals and utilizes large language models (LLMs) to fuse user intention with scene context, enabling intuitive and precise robot manipulation. Our method accurately determines gaze fixation time interval, reducing noise caused by the gaze dynamic nature. Experimental evaluations demonstrate that FAM-HRI achieves a high success rate in task execution while maintaining a low interaction time, providing a practical solution for individuals with limited physical mobility or motor impairments.