🤖 AI Summary
To address the robustness challenges in speech command recognition for human-robot collaboration—particularly under noise, accent variation, and environmental interference—this paper proposes the first Theory of Mind (ToM)-inspired speech command following framework. The method models human goals and shared intentions as cognitive priors, integrating multimodal scene representations, goal-directed Bayesian inference, and end-to-end speech-to-action mapping to enable top-down, cognition-guided understanding. Evaluated in the VirtualHome 2 simulation environment, our approach significantly outperforms existing speech-language models, achieving command-following accuracy approaching human performance. It is further validated on a real-world mobile manipulator, successfully executing multi-step breakfast preparation tasks. The core contribution lies in the first integration of ToM principles into speech command following, establishing a novel paradigm for robust, interpretable, and goal-driven spoken human-robot interaction.
📝 Abstract
Spoken language instructions are ubiquitous in agent collaboration. However, in human-robot collaboration, recognition accuracy for human speech is often influenced by various speech and environmental factors, such as background noise, the speaker's accents, and mispronunciation. When faced with noisy or unfamiliar auditory inputs, humans use context and prior knowledge to disambiguate the stimulus and take pragmatic actions, a process referred to as top-down processing in cognitive science. We present a cognitively inspired model, Speech Instruction Following through Theory of Mind (SIFToM), to enable robots to pragmatically follow human instructions under diverse speech conditions by inferring the human's goal and joint plan as prior for speech perception and understanding. We test SIFToM in simulated home experiments (VirtualHome 2). Results show that the SIFToM model outperforms state-of-the-art speech and language models, approaching human-level accuracy on challenging speech instruction following tasks. We then demonstrate its ability at the task planning level on a mobile manipulator for breakfast preparation tasks.