🤖 AI Summary
In real-world settings, noisy environments degrade the accuracy of voice activity detection (VAD), severely impairing the naturalness and real-time responsiveness of turn-taking in conversational agents. To address this, we propose a noise-robust, on-device Voice Activity Projection (VAP) model built upon a lightweight Transformer architecture, enabling low-latency, high-accuracy local turn prediction. Our work is the first to validate on-device VAP robustness in an open-mall environment—eliminating reliance on cloud infrastructure—and integrates multimodal behavioral cues with subjective user evaluations via structured questionnaires. Experiments demonstrate a substantial reduction in system response latency: user response speed improves by 37%, turn-transition accuracy reaches 91.2%, and subjective user experience scores increase by 2.4 points (on a 5-point scale). This work establishes a deployable, real-time turn-control paradigm for dialogue systems operating in authentic, acoustically challenging environments.
📝 Abstract
Turn-taking is a crucial aspect of human-robot interaction, directly influencing conversational fluidity and user engagement. While previous research has explored turn-taking models in controlled environments, their robustness in real-world settings remains underexplored. In this study, we propose a noise-robust voice activity projection (VAP) model, based on a Transformer architecture, to enhance real-time turn-taking in dialogue robots. To evaluate the effectiveness of the proposed system, we conducted a field experiment in a shopping mall, comparing the VAP system with a conventional cloud-based speech recognition system. Our analysis covered both subjective user evaluations and objective behavioral analysis. The results showed that the proposed system significantly reduced response latency, leading to a more natural conversation where both the robot and users responded faster. The subjective evaluations suggested that faster responses contribute to a better interaction experience.