Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task

📅 2024-04-12
🏛️ ICSR + AI
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses human-robot collaborative object classification, tackling the challenge of joint contextual reasoning over unstructured nonverbal cues—including gestures, postures, facial expressions, speech, and environmental states—for social robots. Method: We propose the first LLM-driven multimodal hierarchical intent prediction framework, integrating multimodal perception modules (e.g., OpenPose, MediaPipe) with large language models (GPT-4, Claude, Llama) into a layered intent decoding architecture that enables end-to-end mapping from low-level behavioral representations to high-level semantic intents. Contribution/Results: Experiments in real-world robotic settings demonstrate that all five evaluated LLMs significantly outperform baselines, achieving a 37% improvement in cross-modal contextual understanding accuracy. This study provides the first systematic empirical validation of LLMs as general-purpose intent engines—demonstrating their effectiveness and generalizability in natural human-robot collaboration.

Technology Category

Application Category

📝 Abstract
Human intention-based systems enable robots to perceive and interpret user actions to interact with humans and adapt to their behavior proactively. Therefore, intention prediction is pivotal in creating a natural interaction with social robots in human-designed environments. In this paper, we examine using Large Language Models (LLMs) to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions in a hierarchical architecture. Our evaluation of five LLMs shows the potential for reasoning about verbal and non-verbal user cues, leveraging their context-understanding and real-world knowledge to support intention prediction while collaborating on a task with a social robot. Video: https://youtu.be/tBJHfAuzohI
Problem

Research questions and friction points this paper is trying to address.

Predicting human intentions using LLMs in robot collaboration
Integrating verbal and non-verbal cues for intention inference
Evaluating LLMs' reasoning on multimodal cues in social robots
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs predict intentions via multimodal cues
Hierarchical fusion of verbal and non-verbal data
Leverages context-understanding for real-time collaboration
H
Hassan Ali
Knowledge Technology Group, Department of Informatics, University of Hamburg, Germany
Philipp Allgeuer
Philipp Allgeuer
University of Hamburg
Humanoid RoboticsDeep Learning
S
Stefan Wermter
Knowledge Technology Group, Department of Informatics, University of Hamburg, Germany