Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

📅 2024-03-18
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
To address poor cross-modal understanding robustness, high response latency, and insufficient cultural adaptability in natural language–driven robotic task execution, this work proposes an end-to-end multimodal human-robot interaction framework. The framework integrates a large language model (LLM), a vision-language model (VLM), and automatic speech recognition (ASR) to jointly perform semantic parsing and task abstraction from spoken or textual instructions to robot actions. It is the first to empirically validate instruction understanding robustness across diverse, real-world users with multiple accents and ethnic backgrounds. The system achieves real-time closed-loop control with an average speech-to-action latency of 0.89 seconds. Experiments demonstrate an 87.55% speech instruction decoding accuracy and an 86.27% task success rate—both significantly outperforming baseline methods—while exhibiting strong cross-cultural applicability.

Technology Category

Application Category

📝 Abstract
In this paper, we extended the method proposed in [21] to enable humans to interact naturally with autonomous agents through vocal and textual conversations. Our extended method exploits the inherent capabilities of pre-trained large language models (LLMs), multimodal visual language models (VLMs), and speech recognition (SR) models to decode the high-level natural language conversations and semantic understanding of the robot's task environment, and abstract them to the robot's actionable commands or queries. We performed a quantitative evaluation of our framework's natural vocal conversation understanding with participants from different racial backgrounds and English language accents. The participants interacted with the robot using both spoken and textual instructional commands. Based on the logged interaction data, our framework achieved 87.55% vocal commands decoding accuracy, 86.27% commands execution success, and an average latency of 0.89 seconds from receiving the participants' vocal chat commands to initiating the robot's actual physical action. The video demonstrations of this paper can be found at https://linusnep.github.io/MTCC-IRoNL/.
Problem

Research questions and friction points this paper is trying to address.

Human-Robot Interaction
Natural Language Processing
Task Execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-trained Language Models
Visual Language Models
Speech Recognition
🔎 Similar Papers
No similar papers found.