Enhancing Speech Instruction Understanding and Disambiguation in Robotics via Speech Prosody

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address the ambiguity in spoken command interpretation caused by missing prosodic cues and the limited robustness of conventional ASR-dependent approaches, this paper proposes the first end-to-end prosody-driven spoken command intent parsing framework. It bypasses text transcription entirely, directly modeling prosodic features from raw speech to infer and resolve ambiguities. We introduce the first benchmark dataset of ambiguous spoken commands specifically designed for robotic task execution. Our method innovatively integrates a prosody-aware module with large language model-based in-context learning (ICL) to jointly perform multimodal alignment and task planning. Experiments demonstrate state-of-the-art performance: 95.79% accuracy in referential intent detection and 71.96% accuracy in selecting correct task plans for ambiguous commands—substantially enhancing robotic systems’ robustness and generalization in understanding natural, prosodically rich spoken instructions.

Technology Category

Application Category

📝 Abstract

Enabling robots to accurately interpret and execute spoken language instructions is essential for effective human-robot collaboration. Traditional methods rely on speech recognition to transcribe speech into text, often discarding crucial prosodic cues needed for disambiguating intent. We propose a novel approach that directly leverages speech prosody to infer and resolve instruction intent. Predicted intents are integrated into large language models via in-context learning to disambiguate and select appropriate task plans. Additionally, we present the first ambiguous speech dataset for robotics, designed to advance research in speech disambiguation. Our method achieves 95.79% accuracy in detecting referent intents within an utterance and determines the intended task plan of ambiguous instructions with 71.96% accuracy, demonstrating its potential to significantly improve human-robot communication.

Problem

Research questions and friction points this paper is trying to address.

Improving robot understanding of spoken instructions via prosody

Resolving ambiguous speech intent using prosodic cues

Enhancing human-robot communication with speech disambiguation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages speech prosody for intent disambiguation

Integrates intents into large language models

Introduces ambiguous speech dataset for robotics

🔎 Similar Papers

SIFToM: Robust Spoken Instruction Following through Theory of Mind