SIFToM: Robust Spoken Instruction Following through Theory of Mind

📅 2024-09-17

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the robustness challenges in speech command recognition for human-robot collaboration—particularly under noise, accent variation, and environmental interference—this paper proposes the first Theory of Mind (ToM)-inspired speech command following framework. The method models human goals and shared intentions as cognitive priors, integrating multimodal scene representations, goal-directed Bayesian inference, and end-to-end speech-to-action mapping to enable top-down, cognition-guided understanding. Evaluated in the VirtualHome 2 simulation environment, our approach significantly outperforms existing speech-language models, achieving command-following accuracy approaching human performance. It is further validated on a real-world mobile manipulator, successfully executing multi-step breakfast preparation tasks. The core contribution lies in the first integration of ToM principles into speech command following, establishing a novel paradigm for robust, interpretable, and goal-driven spoken human-robot interaction.

Technology Category

Application Category

📝 Abstract

Spoken language instructions are ubiquitous in agent collaboration. However, in human-robot collaboration, recognition accuracy for human speech is often influenced by various speech and environmental factors, such as background noise, the speaker's accents, and mispronunciation. When faced with noisy or unfamiliar auditory inputs, humans use context and prior knowledge to disambiguate the stimulus and take pragmatic actions, a process referred to as top-down processing in cognitive science. We present a cognitively inspired model, Speech Instruction Following through Theory of Mind (SIFToM), to enable robots to pragmatically follow human instructions under diverse speech conditions by inferring the human's goal and joint plan as prior for speech perception and understanding. We test SIFToM in simulated home experiments (VirtualHome 2). Results show that the SIFToM model outperforms state-of-the-art speech and language models, approaching human-level accuracy on challenging speech instruction following tasks. We then demonstrate its ability at the task planning level on a mobile manipulator for breakfast preparation tasks.

Problem

Research questions and friction points this paper is trying to address.

Robots struggle with noisy spoken instructions in human-robot collaboration

Current systems lack pragmatic interpretation of imperfect auditory inputs

Existing methods fail to leverage environmental context for instruction understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses neurosymbolic model with Theory of Mind

Leverages Vision-Language Model for mental inference

Enables pragmatic instruction following under diverse speech

🔎 Similar Papers

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind