🤖 AI Summary
Robots struggle to detect miscommunication in dialogue, primarily due to the absence of users’ nonverbal feedback, leading to diminished trust and engagement. This study systematically evaluates machine learning models’ ability to detect four types of miscommunication events using a dataset of 240 multimodal human–robot dialogues, complemented by a human rater comparison experiment. Results show that model performance is only marginally above chance overall, yet achieves meaningful accuracy in detecting confusion within emotionally expressive interactions. Human raters exhibit comparable performance, confirming the general scarcity of explicit user feedback. The core finding is that the lack of nonverbal feedback constitutes a fundamental bottleneck for miscommunication detection. To address this, we propose the “active feedback elicitation” paradigm—a novel framework enabling dialogue systems to proactively solicit and interpret user feedback. This work provides both theoretical foundations and methodological pathways toward self-reflective, self-correcting conversational agents.
📝 Abstract
Detecting miscommunication in human-robot interaction is a critical function for maintaining user engagement and trust. While humans effortlessly detect communication errors in conversations through both verbal and non-verbal cues, robots face significant challenges in interpreting non-verbal feedback, despite advances in computer vision for recognizing affective expressions. This research evaluates the effectiveness of machine learning models in detecting miscommunications in robot dialogue. Using a multi-modal dataset of 240 human-robot conversations, where four distinct types of conversational failures were systematically introduced, we assess the performance of state-of-the-art computer vision models. After each conversational turn, users provided feedback on whether they perceived an error, enabling an analysis of the models' ability to accurately detect robot mistakes. Despite using state-of-the-art models, the performance barely exceeds random chance in identifying miscommunication, while on a dataset with more expressive emotional content, they successfully identified confused states. To explore the underlying cause, we asked human raters to do the same. They could also only identify around half of the induced miscommunications, similarly to our model. These results uncover a fundamental limitation in identifying robot miscommunications in dialogue: even when users perceive the induced miscommunication as such, they often do not communicate this to their robotic conversation partner. This knowledge can shape expectations of the performance of computer vision models and can help researchers to design better human-robot conversations by deliberately eliciting feedback where needed.