Visual Cues Support Robust Turn-taking Prediction in Noise

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Predictive turn-taking models (PTTMs) suffer substantial performance degradation under acoustic noise, severely impairing the naturalness of human–machine interaction in real-world settings. This work presents the first systematic analysis of PTTM vulnerability to noise and introduces a novel multimodal turn-taking model that integrates visual cues—specifically lip motion—to enhance robustness. The model employs audio-visual feature alignment, noise-robust training strategies, and cross-modal attention fusion to mitigate audio signal degradation. Evaluated under 10 dB music noise, it achieves 72% turn-taking accuracy—outperforming the best unimodal audio-only baseline by 20 percentage points and surpassing all existing single-modality approaches. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Accurate predictive turn-taking models (PTTMs) are essential for naturalistic human-robot interaction. However, little is known about their performance in noise. This study therefore explores PTTM performance in types of noise likely to be encountered once deployed. Our analyses reveal PTTMs are highly sensitive to noise. Hold/shift accuracy drops from 84% in clean speech to just 52% in 10 dB music noise. Training with noisy data enables a multimodal PTTM, which includes visual features to better exploit visual cues, with 72% accuracy in 10 dB music noise. The multimodal PTTM outperforms the audio-only PTTM across all noise types and SNRs, highlighting its ability to exploit visual cues; however, this does not always generalise to new types of noise. Analysis also reveals that successful training relies on accurate transcription, limiting the use of ASR-derived transcriptions to clean conditions. We make code publicly available for future research.

Problem

Research questions and friction points this paper is trying to address.

PTTMs' sensitivity to noise reduces turn-taking accuracy

Multimodal PTTMs improve accuracy by using visual cues

Accurate transcription is crucial for training PTTMs effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal PTTM includes visual features

Training with noisy data improves accuracy

Visual cues enhance turn-taking prediction

🔎 Similar Papers

No similar papers found.