🤖 AI Summary
Predictive turn-taking models (PTTMs) suffer substantial performance degradation under acoustic noise, severely impairing the naturalness of human–machine interaction in real-world settings. This work presents the first systematic analysis of PTTM vulnerability to noise and introduces a novel multimodal turn-taking model that integrates visual cues—specifically lip motion—to enhance robustness. The model employs audio-visual feature alignment, noise-robust training strategies, and cross-modal attention fusion to mitigate audio signal degradation. Evaluated under 10 dB music noise, it achieves 72% turn-taking accuracy—outperforming the best unimodal audio-only baseline by 20 percentage points and surpassing all existing single-modality approaches. The implementation is publicly available.
📝 Abstract
Accurate predictive turn-taking models (PTTMs) are essential for naturalistic human-robot interaction. However, little is known about their performance in noise. This study therefore explores PTTM performance in types of noise likely to be encountered once deployed. Our analyses reveal PTTMs are highly sensitive to noise. Hold/shift accuracy drops from 84% in clean speech to just 52% in 10 dB music noise. Training with noisy data enables a multimodal PTTM, which includes visual features to better exploit visual cues, with 72% accuracy in 10 dB music noise. The multimodal PTTM outperforms the audio-only PTTM across all noise types and SNRs, highlighting its ability to exploit visual cues; however, this does not always generalise to new types of noise. Analysis also reveals that successful training relies on accurate transcription, limiting the use of ASR-derived transcriptions to clean conditions. We make code publicly available for future research.