Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task

📅 2024-04-12

🏛️ ICSR + AI

📈 Citations: 1

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses human-robot collaborative object classification, tackling the challenge of joint contextual reasoning over unstructured nonverbal cues—including gestures, postures, facial expressions, speech, and environmental states—for social robots. Method: We propose the first LLM-driven multimodal hierarchical intent prediction framework, integrating multimodal perception modules (e.g., OpenPose, MediaPipe) with large language models (GPT-4, Claude, Llama) into a layered intent decoding architecture that enables end-to-end mapping from low-level behavioral representations to high-level semantic intents. Contribution/Results: Experiments in real-world robotic settings demonstrate that all five evaluated LLMs significantly outperform baselines, achieving a 37% improvement in cross-modal contextual understanding accuracy. This study provides the first systematic empirical validation of LLMs as general-purpose intent engines—demonstrating their effectiveness and generalizability in natural human-robot collaboration.

Technology Category

Application Category

📝 Abstract

Human intention-based systems enable robots to perceive and interpret user actions to interact with humans and adapt to their behavior proactively. Therefore, intention prediction is pivotal in creating a natural interaction with social robots in human-designed environments. In this paper, we examine using Large Language Models (LLMs) to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions in a hierarchical architecture. Our evaluation of five LLMs shows the potential for reasoning about verbal and non-verbal user cues, leveraging their context-understanding and real-world knowledge to support intention prediction while collaborating on a task with a social robot. Video: https://youtu.be/tBJHfAuzohI

Problem

Research questions and friction points this paper is trying to address.

Predicting human intentions using LLMs in robot collaboration

Integrating verbal and non-verbal cues for intention inference

Evaluating LLMs' reasoning on multimodal cues in social robots

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs predict intentions via multimodal cues

Hierarchical fusion of verbal and non-verbal data

Leverages context-understanding for real-time collaboration

🔎 Similar Papers

Fine-tuning Multimodal Large Language Models for Product Bundling