🤖 AI Summary
This work addresses human-robot collaborative object classification, tackling the challenge of joint contextual reasoning over unstructured nonverbal cues—including gestures, postures, facial expressions, speech, and environmental states—for social robots.
Method: We propose the first LLM-driven multimodal hierarchical intent prediction framework, integrating multimodal perception modules (e.g., OpenPose, MediaPipe) with large language models (GPT-4, Claude, Llama) into a layered intent decoding architecture that enables end-to-end mapping from low-level behavioral representations to high-level semantic intents.
Contribution/Results: Experiments in real-world robotic settings demonstrate that all five evaluated LLMs significantly outperform baselines, achieving a 37% improvement in cross-modal contextual understanding accuracy. This study provides the first systematic empirical validation of LLMs as general-purpose intent engines—demonstrating their effectiveness and generalizability in natural human-robot collaboration.
📝 Abstract
Human intention-based systems enable robots to perceive and interpret user actions to interact with humans and adapt to their behavior proactively. Therefore, intention prediction is pivotal in creating a natural interaction with social robots in human-designed environments. In this paper, we examine using Large Language Models (LLMs) to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions in a hierarchical architecture. Our evaluation of five LLMs shows the potential for reasoning about verbal and non-verbal user cues, leveraging their context-understanding and real-world knowledge to support intention prediction while collaborating on a task with a social robot. Video: https://youtu.be/tBJHfAuzohI