🤖 AI Summary
In human–robot dialogue, unnatural interactions arise from poorly timed responses, abrupt pauses, and frequent interruptions. Method: This paper introduces the first zero-shot transfer of general-purpose, fine-tuning-free self-supervised turn-taking models—TurnGPT and Voice Activity Projection (VAP)—to human–robot interaction (HRI). We propose a multi-model collaborative framework for real-time turn-taking decisions, integrating VAP for end-of-turn prediction, TurnGPT for dialogue rhythm modeling, and a large language model (LLM) with the Furhat robot platform to jointly govern response preparation, proactive turn-taking, and interruption recovery—without domain-specific fine-tuning. Contribution/Results: In a controlled study with 39 participants, our approach significantly reduced system response latency, decreased user-initiated interruptions by 42.3%, and substantially improved subjective ratings of naturalness and satisfaction, demonstrating the effectiveness and practicality of zero-shot transfer of general dialogue representations to HRI.
📝 Abstract
Turn-taking is a fundamental aspect of conversation, but current Human-Robot Interaction (HRI) systems often rely on simplistic, silence-based models, leading to unnatural pauses and interruptions. This paper investigates, for the first time, the application of general turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to improve conversational dynamics in HRI. These models are trained on human-human dialogue data using self-supervised learning objectives, without requiring domain-specific fine-tuning. We propose methods for using these models in tandem to predict when a robot should begin preparing responses, take turns, and handle potential interruptions. We evaluated the proposed system in a within-subject study against a traditional baseline system, using the Furhat robot with 39 adults in a conversational setting, in combination with a large language model for autonomous response generation. The results show that participants significantly prefer the proposed system, and it significantly reduces response delays and interruptions.