Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing predictive turn-taking models (PTTM) predominantly rely on audio signals, neglecting critical visual cues—such as facial expressions, head pose, and gaze—which limits natural human–machine interaction in video-mediated settings. To address this, we propose MM-VAP, the first multimodal PTTM framework that systematically integrates speech with multiple visual modalities and introduces a temporal prediction architecture tailored for video conferencing. We further devise a novel silence-duration-based evaluation strategy to assess turn anticipation under varying latency conditions. Through ablation studies and automatic speech alignment analysis, we empirically establish facial expressions as the most discriminative visual cue. This work delivers the first comprehensive empirical analysis of multimodal PTTM. Evaluated on real-world video conference data, MM-VAP achieves 84% turn prediction accuracy—significantly outperforming the audio-only baseline (79%)—and maintains consistent performance gains across all silence-duration intervals.

Technology Category

Application Category

📝 Abstract
Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of automatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.
Problem

Research questions and friction points this paper is trying to address.

Enhancing turn-taking prediction using visual cues in human interaction
Improving accuracy by combining speech with facial and gaze features
Validating visual cues' vital role in multimodal turn-taking models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal PTTM combines speech and visual cues
Visual features improve turn-taking prediction accuracy
Facial expression contributes most to model performance
🔎 Similar Papers
No similar papers found.